When we think of the word “mining,” we often associate it with searching for gold, coal, ore, and other valuable resources. Data mining, however, is much different.
Data Mining Definition
Defined as “knowledge discovery within a database,” data mining is about sifting through large sets of data to uncover patterns, trends, and other truths that may not have been previously visible. The results of data mining can then be analyzed and tested by businesses. In short, data mining is about finding the needle in a haystack.
Data mining searches for patterns and trends in an intelligent way. This means that data mining often utilizes technologies like machine learning software and artificial intelligence (AI), as well as statistical models.
If it sounds confusing, that’s because data mining is a much more mathematical and scientific process for interpreting data. But data mining is a crucial step for helping businesses find the value that lies within all its data.
What is Data Mining
Data mining is an important subset of data science. Contrary to its name, none of data mining is about searching for data itself. As a matter of fact, all the data that’s going to be “mined” will need be gathered in a data warehouse ahead of time. Perhaps “knowledge mining” would be a better phrase than data mining, but let’s move on.
Different data mining techniques can be applied to everything from descriptive to predictive analyses. For example, data mining can search through a businesses’ historical data to see which customers are buying which products at certain times of the year, and map out ways to segment those customers.
Why would a business require customer segmentation? To test new ways of targeting its sales and marketing campaigns – which could lead to higher profits, but also point toward a potential trend or two.
Data mining is essential for finding relationships within large amounts and varieties of big data. This is why everything from business intelligence software to big data analytics programs utilize some form of data mining.
Data Mining Techniques
Because big data is a seemingly random pool of facts and details, a variety of data mining techniques are required to reveal different insights. Our example from earlier explains how data mining can segment customers, but data mining can also determine customer loyalty, identify risks, build predictive models, and much more.
One data mining technique is called clustering analysis, which essentially groups large amounts of data together based on their similarities. This mockup below shows what a clustering analysis may look like.
Data that is sporadically laid out on a chart can actually be grouped in strategic ways through clustering analysis. This analysis can also act as a preprocessing step – which basically means data is formatted in a way so other techniques can be easily applied.
What is it used for? There are a few ways to draw knowledge out of a clustering analysis. Insurance companies can identify groups of policy holders with high average claims. Seismologists can see the origin of earthquake activity and the strength of each earthquake, then apply that insight for designing evacuation routes.
Also known as outlier detection, this data mining technique does perhaps the opposite of clustering. Instead of searching for large groups of data that could be grouped together, anomaly detection looks for data points that are rare and outside an established group or average.
Because data is pretty random, anomalies don’t necessarily point toward a trend. Instead, data that goes against the grain could indicate something abnormal is going on and requires further analysis.
What is it used for? Anomaly detection is most commonly used in fraud detection. For example, anomaly detection can identify suspicious credit card activity and trigger a response. There is usually some level of machine learning involved in this case.
In an age where cyberattacks are more robust and common than ever, anomaly detection helps identify breaches on websites so they can be quickly resolved. This is called intrusion detection.
Association Rule Mining
Looking for groups and outliers are a few ways to mine for knowledge, but another technique called association rule mining looks at how one variable relates to another.
The insight from association rule mining can help businesses identify potential correlations. For example, if event A occurs, then event B is likely to follow. If you’ve ever been suggested products on an e-commerce site based on what’s in your cart, then you’ve seen association rule mining at work.
What is it used for? Walmart applied this data mining technique flawlessly in 2004 during Hurricane Frances. By mining transaction and inventory data, analysts discovered that strawberry Pop-Tart sales were actually seven times higher right before a hurricane hit. Beer was also revealed as the top-selling pre-hurricane item. With this information at-hand, Walmart was sure to stock up.
If a business is looking to make a prediction based on the effect one variable has on others, they may refer to a data mining technique called regression analysis.
On the surface, data is chaotic. There’s a lot of trial and error involved when examining the relationship between one set of data and another – especially when a business is trying to figure out event probabilities and make predictions. Regression analysis can steer these predictions in the right direction.
What is it used for? An example of regression analysis in the healthcare industry is examining the effects body mass index, or BMI, has on other variables.
The example above is called a linear regression analysis, which basically means a straight line can be drawn to show how each variable relates to one another. In this case, we see that the higher total cholesterol someone has, the higher their BMI will be, and vice versa.
Decision Tree Analysis
One of the more visual data mining techniques is called decision tree analysis, and it is a popular method for important decision making.
There are two types of decision tree analyses. One of them is called classification, which is what you see in the example above determining whether or not a passenger would have survived on the Titanic. Classification is logic-based, using a variety of if/then or yes/no conditions until all relevant data is mapped out.
The other decision tree is called regression, which is used when the target decision is a numerical value. For example, regression could be used when determining a house’s value. Both decision trees can be ran through machine learning programs as well.
Big Data and Data Mining
Data mining has traditionally worked with structured data, and for the most part, this still reigns true today. Structured data is essentially data that fits neatly within fixed fields and relational databases. It is the type of data you might see in a spreadsheet format. Structured data is organized in a way so analytic programs can digest it and produce results quickly.
Unfortunately, 80 percent of big data is actually unstructured data, or data that cannot be analyzed in traditional ways. This has posed a challenge to data mining.
Unstructured data is everything from video content and voicemails to emails and text messages. As you may have guessed, there’s a ton of value that lies within this data. But because big data is expanding at pace that’s faster than we can keep up with, few businesses are able to harness unstructured data.
Most of unstructured data, however, is actually text-heavy. While humans don’t talk or type in ways that are logical to machine language, a type of data mining software called text mining is being utilized to make sense of this data.
Text mining, or text analysis software, is still in its relatively infant stage, but the ways it acquires information out of text-heavy unstructured data is pretty unique.
Text-heavy data will first need to be collected and formatted in a uniform way. Text is taken from everything to HTML and XML files to word documents and PDF files. Then embedded image files will be deleted, as they serve no value in regards to text mining.
Next, all text that is considered “noise” will be eliminated. This consists of words like “of,” “a,” “the,” etc. Words will also be reduced to their singular forms. For example, words like “supporting” and “valued” will be reduced to “support” and “value.”
Words that are synonyms will be unified. Numerical values and percentages will be pulled and formatted in their own ways. Now, everything should be as close to structured data as possible. There’s your crash course on how text mining works.
Future of Data Mining
Text mining is the here and now, but the future of data mining will focus on other forms of unstructured data. For example, data from images and videos can be mined for knowledge discovery. There are some frameworks now that focus on image, video, and audio mining, but they’re still in very early stages.
Semantic web mining will also be more prevalent, enabling researchers to find deeper meaning that’s hidden within data on the web. The semantic web is essentially an extension of the world wide web, where data on websites are structured and tagged in a way that’s easier for machines to read.
From big data to business intelligence, all of the data that businesses gather would serve no purpose without knowledge discovery. Data mining allows businesses to visualize patterns and trends of raw data that may have not been previously visible. Whichever insights are revealed will lead to more informed decision making, which is beneficial to both businesses and the customers they serve.