We collected every email and every tweet from the 2020 Democratic primary.
What we found might surprise you.

What Is Text Mining and How Does it Work? (+6 Visuals)

Devin Pickell
Devin Pickell  |  July 9, 2019

Take a few minutes right now and think about all the textual information your brain processes during a standard workday.

Emails, messaging apps, landing pages, call transcripts, customer reviews, and other sources contain so much text, it can be overwhelming to consume this information in a constructive way.

But in the age of digital transformation, there’s a growing solution to help businesses capture massive quantities of text from apps and the Web so it can be analyzed. It’s called text mining, also referred to as text analysis software.

Find the best Text Analysis Software on the market. Explore Now, Free →

Here’s a quick break down of text mining and how it’s used today.

What is text mining?

Text mining is the process of extracting high-quality information from text on apps and throughout the Web. It is part of the larger umbrella of advanced analytics.

How text mining works

In simple terms, text mining works by importing textual data from a variety of sources. Then, natural language processing (NLP) is used to pull insight from the text. Depending on the type of NLP, this insight can vary – but we’ll get into that later.

After NLP, a data visualization is needed to help the human user understand what kind of patterns, trends, and general insight was pulled from the text. That’s text mining and analysis in a nutshell.

Text mining is complex, but you can now understand why it has become such an important part of how we analyze text today. Now it’s time to look deeper into some of the features of text mining.

Features of text mining

There are many components of mining and analyzing text data, but it all starts with information retrieval.

Information retrieval

You can’t analyze text without retrieving it in the first place, which is why information retrieval is the essential preliminary step to text mining.

Information can come from many sources, and it all depends on the objective you’re trying to achieve with text mining. For example, social media is often a hot target for information retrieval during election season to measure how social media users feel about politicians. Databases and internal systems are common sources for interpreting customer and employee sentiment.

After text is retrieved, it’s time to begin structuring it.

Text structuring

One of the main reasons why text mining can be difficult is because it attempts to pull insight from (mostly) unstructured data. Here’s what that means.

Unstructured data are emails, social media posts, comments, reviews, subjective survey results, news articles, and other human-written text. This data is unstructured because humans don’t write in ways that are easily understandable for computers. We have different slangs, phrases, sentence flows, and informalities.

For machines, structured data like what you’ll find in databases and spreadsheets are preferred. So, text needs to be structured after it’s retrieved.

Related Content: Read more about the differences between structured and unstructured data for more in-depth explanations.

Rule-based or statistical NLP is often used to break down text. This is used for part of speech tagging, syntactic parsing, and other types of linguistics.

Part of speech tagging example

part of speech tagging

Source: NLP Course, Sapienza University of Rome

Syntactic parsing example

syntactic parsing example

Named entity recognition

Another part of breaking down text data is using statistical techniques to identify named features like people, businesses, geographical locations, landmarks, well-known abbreviations, and so on.

named entity recognition example

Source: Mohammed Terry-Jack

Disambiguation

To be ambiguous means to have more than one meaning. In human language, we can easily understand ambiguous terms in sentences when given the right context. Here’s an example with the word “Ford.”

disambiguation NLP example

Although, for a computer, understanding exactly which “Ford” we’re referring to can be difficult. Disambiguation helps machines decipher text with context clues.

Coreference resolution

Higher-level NLP will incorporate coreference resolution, which is tasked with finding all expressions in a set of text that refers to the same person or thing. If that sounds confusing, take this example provided by Stanford:

coreference resolution example in NLP

Source: Stanford Natural Language Processing Group

Coreference resolution helps break up longer, more complex sentence structures.

Key-phrase extraction

Most pieces of content on the Web contain structure and a set of keywords/phrases that summarize the overall theme. Key-phrase extraction in text mining helps unveil these patterns and themes using either supervised or unsupervised learning.

Supervised learning uses training data, labels, and tags from many pieces of content to learn the relationship between certain keywords and text.

Unsupervised learning uses no training data. Instead, this method attempts to find naturally occurring patterns and themes between keywords and text.

Sentiment analysis

One of the most important features of text mining is sentiment analysis. This feature extracts text from articles, social media, surveys, reviews – essentially any source where suggestions, comments, or feedback is given.

A sentiment score is then applied, determining if the text is positive, negative, happy, sad, or neutral. This gives the business or agency an idea of how users feel about a particular topic.

distribution of sentiments

Sentiment can be used in many ways. Businesses commonly look at sentiment to see how users feel about a product or service launch. Brands examine sentiment to see which topics are hot and which are not – this can inform future content creation. Even some political campaigns study sentiment to see the popularity of their candidate.

A quick recap of text mining

Text mining uses NLP and other methods to break down the nuances of human language since we talk and write in unstructured ways. The results of mining are analyzed in many ways to uncover patterns and trends in text. This is important, considering an enormous amount of today’s unstructured data is actually text.

Interested in learning more about mining and advanced analytics? Check out our complete guide on data mining to see how it uncovers patterns and trends.

Devin Pickell
Author

Devin Pickell

Devin is a Content Marketing Specialist at G2 Crowd writing about data, analytics, and digital marketing. Prior to G2, he helped scale early-stage startups out of Chicago's booming tech scene. Outside of work, he enjoys watching his beloved Cubs, playing baseball, and gaming. (he/him/his)