Take a few minutes right now and think about all the textual information your brain processes during a standard workday.
Emails, messaging apps, landing pages, call transcripts, customer reviews, and other sources contain so much text, it can be overwhelming to consume this information in a constructive way.
But in the age of digital transformation, there’s a growing solution to help businesses capture massive quantities of text from apps and the Web so it can be analyzed. It’s called text mining, also referred to as text analysis software.
Here’s a quick break down of text mining and how it’s used today.
What is text mining?
Text mining is the process of extracting high-quality information from text on apps and throughout the Web. It is part of the larger umbrella of advanced analytics.
How text mining works
In simple terms, text mining works by importing textual data from a variety of sources. Then, natural language processing (NLP) is used to pull insight from the text. Depending on the type of NLP, this insight can vary – but we’ll get into that later.
After NLP, a data visualization is needed to help the human user understand what kind of patterns, trends, and general insight was pulled from the text. That’s text mining and analysis in a nutshell.
Why do we need text mining?
Text mining is used in business to gauge the sentiment of customers or summarize survey results. It’s used in politics to measure preference for certain candidates. Mining is even used by intelligence agencies to identify areas of cyber-crime.
Text mining is complex, but you can now understand why it has become such an important part of how we analyze text today. Now it’s time to look deeper into some of the features of text mining.
Features of text mining
There are many components of mining and analyzing text data, but it all starts with information retrieval.
You can’t analyze text without retrieving it in the first place, which is why information retrieval is the essential preliminary step to text mining.
Information can come from many sources, and it all depends on the objective you’re trying to achieve with text mining. For example, social media is often a hot target for information retrieval during election season to measure how social media users feel about politicians. Databases and internal systems are common sources for interpreting customer and employee sentiment.
After text is retrieved, it’s time to begin structuring it.
One of the main reasons why text mining can be difficult is because it attempts to pull insight from (mostly) unstructured data. Here’s what that means.
Unstructured data are emails, social media posts, comments, reviews, subjective survey results, news articles, and other human-written text. This data is unstructured because humans don’t write in ways that are easily understandable for computers. We have different slangs, phrases, sentence flows, and informalities.
For machines, structured data like what you’ll find in databases and spreadsheets are preferred. So, text needs to be structured after it’s retrieved.
Rule-based or statistical NLP is often used to break down text. This is used for part of speech tagging, syntactic parsing, and other types of linguistics.
Part of speech tagging example
Source: NLP Course, Sapienza University of Rome
Syntactic parsing example
Named entity recognition
Another part of breaking down text data is using statistical techniques to identify named features like people, businesses, geographical locations, landmarks, well-known abbreviations, and so on.
Source: Mohammed Terry-Jack
To be ambiguous means to have more than one meaning. In human language, we can easily understand ambiguous terms in sentences when given the right context. Here’s an example with the word “Ford.”
Although, for a computer, understanding exactly which “Ford” we’re referring to can be difficult. Disambiguation helps machines decipher text with context clues.
Higher-level NLP will incorporate coreference resolution, which is tasked with finding all expressions in a set of text that refers to the same person or thing. If that sounds confusing, take this example provided by Stanford:
Source: Stanford Natural Language Processing Group
Coreference resolution helps break up longer, more complex sentence structures.
Most pieces of content on the Web contain structure and a set of keywords/phrases that summarize the overall theme. Key-phrase extraction in text mining helps unveil these patterns and themes using either supervised or unsupervised learning.
Supervised learning uses training data, labels, and tags from many pieces of content to learn the relationship between certain keywords and text.
Unsupervised learning uses no training data. Instead, this method attempts to find naturally occurring patterns and themes between keywords and text.
One of the most important features of text mining is sentiment analysis. This feature extracts text from articles, social media, surveys, reviews – essentially any source where suggestions, comments, or feedback is given.
A sentiment score is then applied, determining if the text is positive, negative, happy, sad, or neutral. This gives the business or agency an idea of how users feel about a particular topic.
Sentiment can be used in many ways. Businesses commonly look at sentiment to see how users feel about a product or service launch. Brands examine sentiment to see which topics are hot and which are not – this can inform future content creation. Even some political campaigns study sentiment to see the popularity of their candidate.
A quick recap of text mining
Text mining uses NLP and other methods to break down the nuances of human language since we talk and write in unstructured ways. The results of mining are analyzed in many ways to uncover patterns and trends in text. This is important, considering an enormous amount of today’s unstructured data is actually text.
Interested in learning more about mining and advanced analytics? Check out our complete guide on data mining to see how it uncovers patterns and trends.
Devin is a Content Marketing Specialist at G2 Crowd writing about data, analytics, and digital marketing. Prior to G2, he helped scale early-stage startups out of Chicago's booming tech scene. Outside of work, he enjoys watching his beloved Cubs, playing baseball, and gaming. (he/him/his)