Real-world data is in most cases incomplete, noisy, and inconsistent.
With the exponentially growing data generation and the increasing number of heterogeneous data sources, the probability of gathering anomalous or incorrect data is quite high.
But only high-quality data can lead to accurate models and, ultimately, accurate predictions. Hence, it’s crucial to process data for the best possible quality. This step of processing data is called data preprocessing, and it’s one of the essential steps in data science, machine learning, and artificial intelligence.
Data preprocessing is the process of transforming raw data into a useful, understandable format. Real-world or raw data usually has inconsistent formatting, human errors, and can also be incomplete. Data preprocessing resolves such issues and makes datasets more complete and efficient to perform data analysis.
It’s a crucial process that can affect the success of data mining and machine learning projects. It makes knowledge discovery from datasets faster and can ultimately affect the performance of machine learning models.
In other words, data preprocessing is transforming data into a form that computers can easily work on. It makes data analysis or visualization easier and increases the accuracy and speed of the machine learning algorithms that train on the data.
As you know, a database is a collection of data points. Data points are also called observations, data samples, events, and records.
Each sample is described using different characteristics, also known as features or attributes. Data preprocessing is essential to effectively build models with these features.
Numerous problems can arise while collecting data. You may have to aggregate data from different data sources, leading to mismatching data formats, such as integer and float.
Tip: Use the automation capabilities of machine learning software and say goodbye to those tedious tasks.
If you’re aggregating data from two or more independent datasets, the gender field may have two different values for men: man and male. Likewise, if you’re aggregating data from ten different datasets, a field that’s present in eight of them may be missing in the rest two.
By preprocessing data, we make it easier to interpret and use. This process eliminates inconsistencies or duplicates in data, which can otherwise negatively affect a model’s accuracy. Data preprocessing also ensures that there aren’t any incorrect or missing values due to human error or bugs. In short, employing data preprocessing techniques makes the database more complete and accurate.
For machine learning algorithms, nothing is more important than quality training data. Their performance or accuracy depends on how relevant, representative, and comprehensive the data is.
Before looking at how data is preprocessed, let’s look at some factors contributing to data quality.
For machine learning models, data is fodder.
An incomplete training set can lead to unintended consequences such as bias, leading to an unfair advantage or disadvantage for a particular group of people. Incomplete or inconsistent data can negatively affect the outcome of data mining projects as well. To resolve such problems, the process of data preprocessing is used.
There are four stages of data processing: cleaning, integration, reduction, and transformation.
Data cleaning or cleansing is the process of cleaning datasets by accounting for missing values, removing outliers, correcting inconsistent data points, and smoothing noisy data. In essence, the motive behind data cleaning is to offer complete and accurate samples for machine learning models.
The techniques used in data cleaning are specific to the data scientist’s preferences and the problem they’re trying to solve. Here’s a quick look at the issues that are solved during data cleaning and the techniques involved.
The problem of missing data values is quite common. It may happen during data collection or due to some specific data validation rule. In such cases, you need to collect additional data samples or look for additional datasets.
The issue of missing values can also arise when you concatenate two or more datasets to form a bigger dataset. If not all fields are present in both datasets, it’s better to delete such fields before merging.
Here are some ways to account for missing data:
If 50 percent of values for any of the rows or columns in the database is missing, it’s better to delete the entire row or column unless it’s possible to fill the values using any of the above methods.
A large amount of meaningless data is called noise. More precisely, it’s the random variance in a measured variable or data having incorrect attribute values. Noise includes duplicate or semi-duplicates of data points, data segments of no value for a specific research process, or unwanted information fields.
For example, if you need to predict whether a person can drive, information about their hair color, height, or weight will be irrelevant.
An outlier can be treated as noise, although some consider it a valid data point. Suppose you’re training an algorithm to detect tortoises in pictures. The image dataset may contain images of turtles wrongly labeled as tortoises. This can be considered noise.
However, there can be a tortoise’s image that looks more like a turtle than a tortoise. That sample can be considered an outlier and not necessarily noise. This is because we want to teach the algorithm all possible ways to detect tortoises, and so, deviation from the group is essential.
For numeric values, you can use a scatter plot or box plot to identify outliers.
The following are some methods used to solve the problem of noise:
Since data is collected from various sources, data integration is a crucial part of data preparation. Integration may lead to several inconsistent and redundant data points, ultimately leading to models with inferior accuracy.
Here are some approaches to integrate data:
As the name suggests, data reduction is used to reduce the amount of data and thereby reduce the costs associated with data mining or data analysis.
It offers a condensed representation of the dataset. Although this step reduces the volume, it maintains the integrity of the original data. Data reduction is especially crucial for processing of big data as the amount of data involved would be gigantic.
The following are some techniques used for data reduction.
Dimensionality reduction, also known as dimension reduction, reduces the number of features or input variables in a dataset.
The number of features or input variables of a dataset is called its dimensionality. The higher the number of features, the more troublesome it is to visualize the training dataset and create a predictive model.
In some cases, most of these attributes are correlated, hence redundant; therefore, dimensionality reduction algorithms can be used to reduce the number of random variables and obtain a set of principal variables.
There are two segments of dimensionality reduction: feature selection and feature extraction.
In feature selection, we try to find a subset of the original set of features. This allows us to get a smaller subset that can be used to visualize the problem using data modeling. On the other hand, feature extraction reduces the data in a high-dimensional space to a lower-dimensional space, or in other words, space with a lesser number of dimensions.
The following are some ways to perform dimensionality reduction:
Other dimensionality reduction techniques include factor analysis, independent component analysis, and linear discriminant analysis (LDA).
Feature subset selection is the process of selecting a subset of features or attributes that contribute the most or are the most important.
Suppose you’re trying to predict whether a student will pass or fail by looking at historical data of similar students. You have a dataset with four features: roll number, total marks, study hours, and extracurricular activities.
In this case, roll numbers do not affect students’ performance and can be eliminated. The new subset will have just three features and will be more efficient than the original set.
This data reduction approach can help create faster and more cost-efficient machine learning models. Attribute subset selection can also be performed in the data transformation step.
Numerosity reduction is the process of replacing the original data with a smaller form of data representation. There are two ways to perform this: parametric and non-parametric methods.
Parametric methods use models for data representation. Log-linear and regression methods are used to create such models. In contrast, non-parametric methods store reduced data representations using clustering, histograms, data cube aggregation, and data sampling.
Data transformation is the process of converting data from one format to another. In essence, it involves methods for transforming data into appropriate formats that the computer can learn efficiently from.
For example, the speed units can be miles per hour, meters per second, or kilometers per hour. Therefore a dataset may store values of the speed of a car in different units as such. Before feeding this data to an algorithm, we need to transform the data into the same unit.
The following are some strategies for data transformation.
This statistical approach is used to remove noise from the data with the help of algorithms. It helps highlight the most valuable features in a dataset and predict patterns. It also involves eliminating outliers from the dataset to make the patterns more visible.
Aggregation refers to pooling data from multiple sources and presenting it in a unified format for data mining or analysis. Aggregating data from various sources to increase the number of data points is essential as only then the ML model will have enough examples to learn from.
Discretization involves converting continuous data into sets of smaller intervals. For example, it’s more efficient to place people in categories such as “teen,” “young adult,” “middle age,” or “senior” than using continuous age values.
Generalization involves converting low-level data features into high-level data features. For instance, categorical attributes such as home address can be generalized to higher-level definitions such as city or state.
Normalization refers to the process of converting all data variables into a specific range. In other words, it’s used to scale the values of an attribute so that it falls within a smaller range, for example, 0 to 1. Decimal scaling, min-max normalization, and z-score normalization are some methods of data normalization.
Feature construction involves constructing new features from the given set of features. This method simplifies the original dataset and makes it easier to analyze, mine, or visualize the data.
Concept hierarchy generation lets you create a hierarchy between features, although it isn’t specified. For example, if you have a house address dataset containing data about the street, city, state, and country, this method can be used to organize the data in hierarchical forms.
Machine learning algorithms are like kids. They have little to no understanding of what’s favorable or unfavorable. Like how kids start repeating foul language picked up from adults, inaccurate or inconsistent data easily influences ML models. The key is to feed them high-quality, accurate data, for which data preprocessing is an essential step.
Machine learning algorithms are usually spoken of as hard workers. But there’s an algorithm that’s often labeled as lazy. It’s called the k-nearest neighbor algorithm and is an excellent classification algorithm.
Amal is a Research Analyst at G2 researching the cybersecurity, blockchain, and machine learning space. He's fascinated by the human mind and hopes to decipher it in its entirety one day. In his free time, you can find him reading books, obsessing over sci-fi movies, or fighting the urge to have a slice of pizza.
Cross-validation is an invaluable tool for data scientists.
Unsupervised learning lets machines learn on their own.
Machine learning models are as good as the data they're trained on.
Cross-validation is an invaluable tool for data scientists.
Unsupervised learning lets machines learn on their own.