Cross-validation is an invaluable tool for data scientists.
It's useful for building more accurate machine learning models and evaluating how well they work on an independent test dataset.
Cross-validation is easy to understand and implement, making it a go-to method for comparing the predictive capabilities (or skills) of different models and choosing the best. It's beneficial when the amount of data available is limited and is a great way to check how a predictive model works in practice.
What is cross-validation?
Cross-validation (CV) is a technique used to assess a machine learning model and test its performance (or accuracy). It involves reserving a specific sample of a dataset on which the model isn't trained. Later on, the model is tested on this sample to evaluate it.
Cross-validation is used to protect a model from overfitting, especially if the amount of data available is limited. It's also known as rotation estimation or out-of-sample testing and is mainly used in settings where the model's target is prediction.
Did you know? A model is considered "overfitted" if it models the training data so well that it negatively affects its performance on new data.
This resampling procedure is also used to compare different machine learning models and determine how well they work to solve a particular problem. In other words, cross-validation is a method used to assess the skill of machine learning models.
Simply put, in the process of cross-validation, the original data sample is randomly divided into several subsets. The machine learning model trains on all subsets, except one. After training, the model is tested by making predictions on the remaining subset.
In many instances, multiple rounds of cross-validation are performed using different subsets, and their results are averaged to determine which model is a good predictor.
Why is cross-validation important?
Cross-validation is crucial where the amount of data available is limited.
Suppose you need to predict the likelihood of a bicycle tire getting punctured. For this, you have collected data on the existing tires: the age of the tire, the number of miles endured, the weight of the rider, and whether it was punctured before.
To create a predictive model, you'll use this (historical) data. There are two things you need to do with this data – train the algorithm and test the model.
Did you know? In machine learning, an algorithm and a model aren't the same. A model is what is learned by the machine learning algorithm.
Since you only have a limited amount of data available, it would be naive to use all of the data on training the algorithm. If you do it, you wouldn't have any data left to test or evaluate the model.
Reusing the training set as the test set isn't a great idea as we need to evaluate the model's accuracy on data that it wasn't trained on. It's because the main objective behind the training is to prepare the model to work on real-world data. And it's improbable that your training data set contains all possible data points that the model will ever encounter.
A better idea would be to use the first 75 percent (three blocks) of the data as the training data set and the last 25 percent (one block) as the testing data set. This will allow you to compare how well different algorithms categorized the test data.
But of course, how would you know that using the first 75 percent of the data as the training set and the remaining 25 percent as the test set is the best way?
Instead, you can use the first 25 percent of the data for testing; or, you can use the third block of the data as the test data set and the remaining as the training dataset.
As a result, a type of cross-validation called k-fold cross-validation uses all (four) parts of the data set as test data, one at a time, and then summarizes the results.
For example, cross-validation will use the first three blocks of the data to train the algorithm and use the last block to test the model. It then records how well the model performed with the test data.
After recording the performance or accuracy, it’ll use the 1st, 2nd, and 4th blocks of the data to train and the 3rd block to test. The process continues until all blocks are used once as test data. The average of all results is calculated to evaluate the model's performance.
In the above example, the data was divided into four blocks. Hence, this cross-validation is called 4-fold cross-validation. If it were divided into ten blocks, it would be 10-fold cross-validation.
In short, cross-validation is useful for model selection and makes it effortless to examine how well a model generalizes to new data.
In other words, it's helpful to determine the prediction error of a model. It's also used to compare the performance or accuracy of different machine learning methods like support vector machine (SVM), K-nearest neighbor (KNN), linear regression, or logistic regression.
Here are some more reasons why data scientists love cross-validation:
- Lets them use all of the data without sacrificing any subset (not valid for the holdout method)
- Reveals the consistency of the data and the algorithm
- Helps avoid overfitting and underfitting
Cross-validation is also used to tune the hyperparameters of a machine learning model through a technique called randomized grid search cross-validation.
Types of cross-validation
Cross-validation methods can be broadly classified into two categories: exhaustive and non-exhaustive methods.
As the name suggests, exhaustive cross-validation methods strive to test on all possible ways to divide the original data sample into a training and a testing set. On the other hand, non-exhaustive methods don't compute all ways of partitioning the original data into training and evaluation sets.
Below are the five common types of cross-validation.
1. Holdout method
The holdout method is one of the basic cross-validation approaches in which the original dataset is divided into two parts – training data and testing data. It's a non-exhaustive method, and as expected, the model is trained on the training dataset and evaluated on the testing dataset.
In most cases, the size of the training dataset is twice more than the test dataset, meaning the original dataset is split in the ratio of 80:20 or 70:30. Also, the data is randomly shuffled before dividing it into training and validation sets.
However, there are some downsides to this cross-validation method. Since the model is trained on a different combination of data points, it can exhibit varying results every time it's trained. Additionally, we can never be entirely sure that the training dataset chosen represents the entire dataset.
If the original data sample isn't too large, there's also a chance that the test data may contain some crucial information, which the model will fail to recognize as it's not included in the training data.
However, the holdout cross-validation technique is ideal if you're in a hurry to train and test a model and have a large dataset.
2. K-fold cross-validation
The k-fold cross-validation method is an improved version of the holdout method. It brings more consistency to the model's score as it doesn't depend on how we choose the training and testing dataset.
It's a non-exhaustive cross-validation method, and as the name suggests, the dataset is divided into k number of splits, and the holdout method is performed k times.
For example, if the value of k is equal to two, there will be two subsets of equal sizes. In the first iteration, the model is trained on one subsample and validated on the other. In the second iteration, the model is trained on the subset that was used to validate in the previous iteration and tested on the other subset. This approach is called 2-fold cross-validation.
Similarly, if the value of k is equal to five, the approach is called the 5-fold cross-validation method and will involve five subsets and five iterations. Also, the value of k is arbitrary. Generally, the value of k is set to 10. If you're confused about choosing a value, the same is recommended.
The k-fold cross-validation procedure starts with randomly splitting the original dataset into k number of folds or subsets. In each iteration, the model is trained on the k-1 subsets of the entire dataset. After that, the model is tested on the kth subset to check its performance.
This process is repeated until all of the k-folds have served as the evaluation set. The results of each iteration are averaged, and it's called the cross-validation accuracy. Cross-validation accuracy is used as a performance metric to compare the efficiency of different models.
The k-fold cross-validation technique generally produces less biased models as every data point from the original dataset will appear in both the training and testing set. This method is optimal if you have a limited amount of data.
However, as expected, this process might be time-consuming because the algorithm has to rerun k times from scratch. This also means that it takes k-1 times more computation than the holdout method.
3. Stratified k-fold cross-validation
Since we're randomly shuffling data and splitting it into folds in k-fold cross-validation, there's a chance that we end up with imbalanced subsets. This can cause the training to be biased, which results in an inaccurate model.
For example, consider the case of a binary classification problem in which each of the two types of class labels comprises 50 percent of the original data. This means that the two classes are present in the original sample in equal proportions. For the sake of simplicity, let's name the two classes A and B.
While shuffling data and splitting it into folds, there's a high chance that we end up with a fold in which the majority of data points are from class A and only a few from class B. Such a subset is seen as an imbalanced subset and can lead to creating an inaccurate classifier.
To avoid such situations, the folds are stratified using a process called stratification. In stratification, the data is rearranged to ensure that each subset is a good representation of the entire dataset.
In the above example of binary classification, this would mean it's better to divide the original sample so that half of the data points in a fold are from class A and the rest from class B.
4. Leave-p-out cross-validation
Leave-p-out cross-validation (LpOCV) is an exhaustive method in which p number of data points are taken out from the total number of data samples represented by n.
The model is trained on n-p data points and later tested on p data points. The same process is repeated for all possible combinations of p from the original sample. Finally, the results of each iteration are averaged to attain the cross-validation accuracy.
5. Leave-one-out cross-validation
The leave-one-out cross-validation (LOOCV) approach is a simplified version of LpOCV. In this cross-validation technique, the value of p is set to one. Hence, this method is much less exhaustive. However, the execution of this method is expensive and time-consuming as the model has to be fitted n number of times.
There are other cross-validation techniques, including repeated random subsampling validation, nested cross-validation, and time-series cross-validation.
Applications of cross-validation
The primary application of cross-validation is to evaluate the performance of machine learning models. This helps compare machine learning methods and determine which is ideal for solving a specific problem.
For example, suppose you're considering k-nearest neighbors (KNN) or principal component analysis (PCA) to perform optical character recognition. In this case, you can use cross-validation to compare the two based on the number of characters misclassified by each method.
Cross-validation can also be used in feature selection to select features that contribute the most to the prediction output.
Limitations of cross-validation
The primary challenge of cross-validation is the need for excessive computational resources, especially in methods such as k-fold CV. Since the algorithm has to be rerun from scratch for k times, it requires k times more computation to evaluate.
Another limitation is the one that surrounds unseen data. In cross-validation, the test dataset is the unseen dataset used to evaluate the model's performance. In theory, this is a great way to check how the model works when used for real-world applications.
But, there can never be a comprehensive set of unseen data in practice, and one can never predict the kind of data that the model might encounter in the future.
Suppose a model is built to predict an individual's risk of contracting a specific infectious disease. If the model is trained on data from a research study involving only a particular population group (for example, women in the mid-20s), when it's applied to the general population, the predictive performance might differ dramatically compared to the cross-validation accuracy.
Furthermore, cross-validation will produce meaningful results only if human biases are controlled in the original sample set.
Cross-validation to the rescue
Cross-validated model building is an excellent method to create machine learning applications with greater accuracy or performance. Cross-validation techniques like k-fold cross-validation make it possible to estimate a model's performance without sacrificing the test split.
They also eliminate the issues that an imbalance data split causes; in short, they can enable data scientists to rely less on luck and more on iterations.
There's a subset of machine learning that tries to mimic the functioning of the human brain. It's called deep learning, and artificial general intelligence, if ever possible, would require its decision-making abilities.