Nagesh Singh Chauhan
Cross-Validation Techniques in Machine Learning
The article dives deep into various cross-validation techniques in machine learning.
Any machine learning model needs to always predict the correct output across a variety of different input values, present in diverse datasets. This aspect of a machine learning model is called stability. If a model does not vary much when the input data is altered, it means that it has been trained well to generalize and find patterns in our data. A model can lose stability in two ways:
Underfitting: It arises when the model does not fit properly with training data. It does not discover patterns in the data and thus when it is given new data to predict, it cannot find patterns in it too. It underperforms on both known and unrecognized data.
Overfitting: When the model trains nicely on training data and generalizes to it, but fails to serve on new, unrecognized data. It grabs slight variations in training data and cannot serve on data that does not have the same variations.
The figures shown below show unfit, overfit, and optimally fit models:
In Figure 1, we can see that the model does not completely capture all the features of our data and ditches out some important data points. This model has generalized our data too much and is under-fitted. In figure 2, the model captures the intricacies of our model while ignoring the noise, this model is our optimal model. In Figure 3, our model has captured every single characteristic of the data, including the noise. If we were to give it a different dataset, it would not be able to predict it as it is too explicit to our training data, hence it is overfitted.
What is Cross-Validation?
Cross-validation(CV) is a statistical method for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern.
CV is easy to understand, easy to implement, and it tends to have a lower bias than other methods used to count the model’s efficiency scores. All of this makes cross-validation a powerful tool for selecting the best model for a specific task.
There are a lot of different techniques that may be used to cross-validate a model. Still, all of them have a similar algorithm:
Shuffle the dataset in order to remove any kind of order
Split the data into K number of folds. K= 5 or 10 will work for most of the cases.
Now keep one fold for testing and remaining all the folds for training.
Train(fit) the model on the train set and test(evaluate) it on the test set and note down the results for that split
Now repeat this process for all the folds, every time choosing a separate fold as test data
So for every iteration, our model gets trained and tested on different sets of data
At the end sum up the scores from each split and get the mean score
As you may know, there are plenty of CV techniques. Some of them are commonly used, and others work only in theory. Let’s see the cross-validation methods that will be covered in this article.
Hold-out (Train-Test Split)
Monte Carlo (Shuffle-Split)
Time series (Rolling cross-validation)
The holdout method is a non-exhaustive cross-validation technique based on the randomly allocated data points in a training dataset and test dataset. The available dataset is split into a training and test dataset, with the test data normally having fewer data points. The selection of the test and training data points is randomized. As we have to split the dataset into training and validation sets just once and the model will be built just once on the training set so gets executed quickly.
Although this method doesn’t take any overhead to compute and is more reasonable than traditional validation, it still suffers from problems of high variance. This is because it is not confident which data points will end up in the validation set and the outcome might be completely different for different sets.
Another drawback of this technique is that it is Not Suitable for an imbalanced dataset. Suppose we have an imbalanced dataset that has class ‘0’ and class ‘1'. Let’s say 80% of data belongs to class ‘0’ and the remaining 20% of data to class ‘1’. By splitting the data into train-test with train set size as 80% and test data size as 20% of the dataset. It may occur that all 80% of the data of class ‘0’ may be in the training set and all data of class ‘1’ in the test set. So our model will not generalize well for our test data as it hasn’t seen data of class ‘1’ before.
Also, A large fragment of data gets restrained from training the model. In the case of a small dataset, a part will be kept aside for testing the model which may have important characteristics which our model may miss out on as it has not trained on that data.
In the case of K-Fold cross-validation input data is divided into ‘K’ number of folds, hence the name K Fold. Consider we have divided data into 5 folds i.e. K=5. Now we have 5 sets of data to train and test our model. So the model will get trained and tested 5 times, but for every iteration, we will use one fold as test data and the rest all as training data. Note that for every iteration, data in training and test fold changes which adds to the efficacy of this method.
This remarkably decreases underfitting as we are using most of the data for training(fitting), and most of the data is also being used in the validation set. K Fold cross-validation aids to generalize the machine learning model, which results in better predictions on unexplored data.
For most cases, 5 or 10 folds are adequate but based on the problem you can split the data into any number of folds.
In general, it is invariably better to use the k-Fold technique rather than the hold-out technique. In a head to head, comparison k-Fold gives a long-lasting and reliable result since training and testing are conducted on several different parts of the dataset. We can make the overall score even more powerful if we increase the number of folds to test the model on numerous different sub-datasets.
Some of the drawbacks include: firstly Not to be used for imbalanced datasets, it may happen that all samples of the training set will have no sample from class “1” and only of class “0”.And the validation set will have a sample of class “1”.
Secondly, not eligible for Time Series data, because the order of the samples matter. But in K-Fold Cross-Validation, samples are chosen in random order.
The leave-p-out cross-validation (LPOCV) technique leaves p data points out of training data, i.e. if there are n data points in the original sample then, n-p samples are used to train the model, and p points are used as the validation set. This is replicated for all combinations in which the original sample can be divided this way, and then the error is averaged for all trials, to give overall effectiveness.
This technique is exhaustive in the sense that it requires training and validating the model for all possible combinations, and for somewhat large p, it can become computationally infeasible.
Consider we have 100 samples in the dataset. If we use p=10 then in each iteration 10 values will be used as a validation set and the remaining 90 samples as the training set.
Leave-one-out сross-validation (LOOCV) is an extreme case of k-Fold CV where 1 sample point is used as a validation set and the remaining n-1 samples are used as the training set.
Suppose we have 50 samples in the dataset. Then in each iteration 1 value will be used as a validation set and the remaining 49 samples as the training set. Thus the process is replicated till every sample of the dataset is used as a validation point.
Stratified K-folds cross-validation
Since we're randomly scrambling data and splitting it into folds in k-fold cross-validation, there's a chance that we end up with imbalanced subsets. This can compel the training to be biased, which results in an erroneous model.
For example, consider the case of a binary classification problem in which each of the two types of class labels contains 50% of the original data. This means that the two classes are present in the original sample in identical proportions. For the sake of clarity, let's name the two classes A and B.
While scrambling data and splitting it into folds, there's a high chance that we end up with a fold in which the bulk of data points are from class A and only a few from class B. Such a subset is seen as an imbalanced subset and can lead to creating an erroneous model.
To avoid such cases, the folds are stratified using a process called stratification. In stratification, the data is rearranged to ensure that each subset is a good manifestation of the entire dataset.
In the above example of binary classification, this would mean it's better to divide the original sample so that half of the data points in a fold are from class A and the rest from class B.
Repeated K-folds cross-validation
The results from k-fold can be noisy, as each time the code is run a slightly different result is achieved. This is owed to having differing splits of the data set into the k-folds. The model accuracy can alter between each execution and it can be challenging to decide which iteration of the model should be used.
One way to address this possible noise is to evaluate the model accuracy/performance based on running k-fold a number of times and calculating the performance across all the repetitions. This technique is called Repeated k-Fold Cross-Validation. Yes, there is a computation cost for executing this technique, and it is therefore suited to datasets of a smaller scale. In most cases having data sets up to 1M records/cases is possible, and depending on the infrastructure and memory, it can scale to many times that and still be somewhat quick to run.
Repeated K-Fold Cross-Validation. The 10-fold CV works by dividing the training data into 10 equal parts. These parts are iterated 10 times. During each iteration, 9 of the 10 parts are treated as training data and the remaining 10th part as the validation set. The performance metrics are measured after each iteration. In the end, accuracy and kappa values are computed as measures of model performance. The above procedure is repeated 10 times. Credits
How many repeats should be performed?
It depends on how noisy the data is, but in a similar way to having 10 as a default value for k, the number of repetitions defaults to 10 but can be adjusted to say 5, some investigation is required to decide on an appropriate value.
Nested K-folds cross-validation
Nested Cross-Validation (Nested-CV) nests cross-validation and hyperparameter tuning. It is used to evaluate the performance of a machine learning algorithm and also estimates the generalization error of the underlying model and its hyperparameter search.
The inner loop performs cross-validation to determine the best features and model hyper-parameters using the k-1 data folds available at each iteration of the outer loop. The model is trained once for each outer loop step and evaluated on the held-out data fold. This process produces k evaluations of the model performance, one for each data fold, and permits the model to experiment on every sample.
It is to be noted that this technique is computationally expensive because a bunch of models are trained and evaluated. Unfortunately, there is no built-in method in sklearn that would perform Nested k-Fold CV for you.
Monte Carlo cross-validation
Monte Carlo operates rather differently. You randomly select (without replacement) some fraction of your data to form the training set, and then assign the rest of the points to the test set. This process is then repeated multiple times, generating (at random) new training and test partitions each time.
For example, suppose you chose to use 10% of your data as test data. Then your test set on rep #1 might be points 64, 90, 63, 42, 65, 49, 10, 64, 96, and 48. On the next run, your test set might be 90, 60, 23, 67, 16, 78, 42, 17, 73, and 26. Since the partitions are done independently for each run, the same point can appear in the test set multiple times, which is the major difference between Monte Carlo and cross-validation.
Time series cross-validation
For time-series data the above-mentioned techniques are not the best ways to evaluate the models. Here are two reasons why this is not an ideal way to go:
Scrambling the data messes up the time division of the data as it will disrupt the order of events
Using cross-validation, there is a chance that we train the model on future data and test on past data which will break the golden rule in time series i.e. “peaking in the future is not allowed”.
There are mainly two ways to solve this:
So, here, we create the fold (or subsets) in a forward-chaining fashion. Consider we have a time series for stock prices for a period of n years and we divide the data yearly into n number of folds. The folds would be created like:
iteration 1: training , test  iteration 2: training [1 2], test  iteration 3: training [1 2 3], test  iteration 4: training [1 2 3 4], test  iteration 5: training [1 2 3 4 5], test  . . . iteration n: training [1 2 3 ….. n-1], test [n]
Here as we can see in the first iteration, we train on the data of the first year and then test it in 2nd year. Similarly in the next iteration, we train the data of the first and second years and then test on the third year of data.
Note: It is not necessary to divide the data into years, I simply took this example to make it more understandable and easy.
The first technique may introduce leakage from future data to the model. The model will observe future patterns to forecast and try to remember them. That’s why blocked cross-validation was introduced.
It works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as a regressor and another as a response. The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from one iteration to the next.
Limitations of cross-validation
The primary challenge of cross-validation is the need for outrageous computational resources, primarily in methods such as k-fold CV. Since the algorithm has to be rerun from scratch for k times, it requires k times more computation to estimate.
Another limitation is the one that surrounds unseen data. In cross-validation, the test dataset is the unseen dataset used to evaluate the model's performance. In theory, this is a great way to check how the model works when used for real-world applications.
But, there can never be a complete set of unseen data in practice, and one can never predict the kind of data that the model might confront in the future.
Suppose a model is built to predict an individual's risk of contracting a specific contagious disease. If the model is trained on data from a research study involving only a certain population set (for example, men in the mid-30s), when it's applied to the general population, the predictive performance might differ dramatically compared to the cross-validation accuracy.