Brittany Kaiser, former director of business development for Cambridge Analytica, stated in Netflix’s The Great Hack that data is now more valuable than oil.
And just like oil, gold, ore, and other resources, there’s hidden value in data that needs to be mined and extracted. This process is referred to as data mining.
What is data mining?
Data mining is commonly referred to as knowledge discovery within databases. It’s about sifting through massive datasets to uncover patterns, trends, and other truths about data that aren’t initially visible. The results of data mining are then analyzed, tested, and applied. In short, data mining is about finding a needle in a haystack.
Data mining is conducted using machine learning software algorithms and statistics. These methods help reduce ‘noise’ in databases to extract useful information.
So, why would a business concern itself with data mining?
Why is data mining important?
Data mining explores a businesses’ historical data during the data analysis process to look at past performances or future forecasts. This leads to faster, more efficient decision making.
For example, through data mining, a business may be able to see which customers are buying certain products at certain times of the year. This information can then be used to segment those customers. Customer segmentation is important for targeting sales and marketing campaigns – which could lead to higher profits, but also point toward a potential trend or two.
Now that you have some base knowledge of what data mining is and why it’s important, let’s take a look at some of the common techniques used today.
Data mining techniques
A variety of data mining techniques are often required to uncover insights that lie within big datasets. Our example from earlier explains how data mining can segment customers, but mining can also help determine customer loyalty, identify risks, build predictive models, and much more. Below, we dive more into each technique.
Common types of data mining techniques
- Clustering analysis
- Outlier detection
- Association rule mining
- Regression analysis
- Decision tree analysis
One data mining technique is called clustering analysis, which essentially groups large quantities of data together based on their similarities. This mockup below shows what a clustering analysis may look like.
Data that is sporadically laid out on a chart can actually be grouped in strategic ways through clustering analysis. This analysis can also act as a preprocessing step – which basically means data is formatted in a way so other techniques can be easily applied.
What is it used for? There are a few ways to draw knowledge out of a clustering analysis. Insurance companies can identify groups of policyholders with high average claims. Seismologists can see the origin of earthquake activity and the strength of each earthquake, then apply that insight for designing evacuation routes.
Also known as anomaly detection, this data mining technique does perhaps the opposite of clustering. Instead of searching for large groups of data that could be clustered together, outlier detection looks for data points that are rare and outside an established group or average.
Because data is pretty random, anomalies don’t necessarily point toward a trend. Instead, data that goes against the grain could indicate something abnormal is going on and requires further analysis.
What is it used for? Outlier detection is most commonly used in detecting fraudulent behavior. For example, outlier detection can identify suspicious credit card activity and trigger a response (such as an account freeze).
In an age where cyber-attacks are more robust and common than ever, outlier detection helps identify breaches on websites so they can be quickly resolved. This is called intrusion detection.
Association rule mining
Looking for groups and outliers are a few ways to mine for knowledge, but another technique called association rule mining looks at how one variable relates to another.
The insight from association rule mining can help businesses identify potential correlations. For example, if event A occurs, then event B is likely to follow. If you’ve ever been suggested products on an e-commerce site based on what’s in your cart, then you’ve seen association rule mining at work.
What is it used for? Walmart applied this data mining technique flawlessly in 2004 during Hurricane Frances. By mining transaction and inventory data, analysts discovered that strawberry Pop-Tart sales were actually seven times higher right before the hurricane hit. Beer was also revealed as the top-selling pre-hurricane item. With this information at-hand, Walmart was sure to stock up.
If a business is looking to make a prediction based on the effect one variable has on others, they may refer to a data mining technique called regression analysis.
On the surface, data is chaotic. There’s a lot of trial and error involved when examining the relationship between one set of data and another – especially when a business is trying to figure out event probabilities and make predictions. Regression analysis can steer these predictions in the right direction.
What is it used for? An example of regression analysis in the healthcare industry is examining the effects that body mass index, or BMI, has on other variables.
The example above is called a linear regression analysis, which basically means a straight line can be drawn to show how each variable relates to one another. In this case, we see that the higher total cholesterol someone has, the higher their BMI will be, and vice versa.
Decision tree analysis
One of the more visual data mining techniques is called decision tree analysis, and it is a popular method for important decision making.
There are two types of decision tree analyses. One of them is called classification, which is what you see in the example above determining whether or not a passenger would have survived on the Titanic. Classification is logic-based, using a variety of if/then or yes/no conditions until all relevant data is mapped out.
The other decision tree is called regression, which is used when the target decision is a numerical value. For example, regression could be used when determining a house’s value. Both decision trees can be run through machine learning programs as well.
Text mining, or text analysis software, is an extension of data mining using natural language processing (NLP) to extract information out of text-heavy unstructured data. Here’s an example of how text mining works.
Text-heavy data will first need to be collected and formatted in a uniform way. Text is taken from everything to HTML and XML files to word documents and PDF files. Then embedded image files will be deleted, as they serve no value in regards to text mining.
Next, all text that is considered “noise” will be eliminated. This consists of words like “of,” “a,” “the,” etc. Words will also be reduced to their singular forms. For example, words like “supporting” and “valued” will be reduced to “support” and “value.”
Words that are synonyms will be unified. Numerical values and percentages will be pulled and formatted in their own ways. Phrases, key terms, sentence structures, and other nuances of the human language will be broken down as well. Now, everything should be as close to structured data as possible.
This is a simplified explanation of text mining. For a more detailed break down, check out our quick guide on text mining and its features.
Future of data mining
Text mining is the here and now, but the future of data mining will focus on other forms of unstructured data as well. For example, data from images and videos can be mined for knowledge discovery. There are some frameworks now that focus on image, video, and audio mining, but they’re still in very early stages.
Semantic Web mining will also be more prevalent, enabling researchers to find deeper meaning that’s hidden within data on the Web. The semantic Web is essentially an extension of the World Wide Web where data on websites are structured and tagged in a way that’s easier for machines to read.
From business intelligence to big data analytics, all of the data that businesses gather would serve no purpose without knowledge discovery.
Data mining allows businesses to visualize patterns and trends of raw data that may not be initially visible. Whichever insights are revealed will lead to faster, more informed decision making. This is beneficial to both businesses and the customers they serve.