The Ultimate Guide To Cross-Validation In Machine Learning

Lesson 20 of 40By Simplilearn

Last updated on Mar 18, 202446926

Tutorial Playlist

The Ultimate Machine Learning Tutorial
Overview
An Introduction To Machine Learning
Lesson - 1
What is Machine Learning and How Does It Work?
Lesson - 2
Machine Learning Steps: A Complete Guide
Lesson - 3
Top 10 Machine Learning Applications in 2025
Lesson - 4
Different Types of Machine Learning: Exploring AI's Core
Lesson - 5
A Beginner's Guide to Supervised & Unsupervised Learning in AI
Lesson - 6
Everything You Need to Know About Feature Selection
Lesson - 7
Linear Regression in Python
Lesson - 8
Everything You Need to Know About Classification in Machine Learning
Lesson - 9
Logistic Regression
Lesson - 10
Understanding the Difference Between Linear vs Logistic Regression
Lesson - 11
The Best Guide On How To Implement Decision Tree In Python
Lesson - 12
Random Forest Algorithm
Lesson - 13
Understanding Naive Bayes Classifier
Lesson - 14
Guide to Confusion Matrix
Lesson - 15
How to Leverage KNN Algorithm in Machine Learning?
Lesson - 16
K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases
Lesson - 17
PCA in Machine Learning: Your Complete Guide to Principal Component Analysis
Lesson - 18
What is Cost Function in Machine Learning
Lesson - 19
The Ultimate Guide to Cross-Validation in Machine Learning
Lesson - 20
Stock Price Prediction Using Machine Learning
Lesson - 21
What Is Reinforcement Learning: A Complete Guide
Lesson - 22
What Is Q-Learning: The Best Guide to Understand Q-Learning
Lesson - 23
The Best Guide to Regularization in Machine Learning
Lesson - 24
Everything You Need to Know About Bias and Variance
Lesson - 25
The Complete Guide on Overfitting and Underfitting in Machine Learning
Lesson - 26
Mathematics for Machine Learning - Important Skills You Must Possess
Lesson - 27
A One-Stop Guide to Statistics for Machine Learning
Lesson - 28
Embarking on a Machine Learning Career? Here’s All You Need to Know
Lesson - 29
How to Become a Machine Learning Engineer?
Lesson - 30
Top 45 Machine Learning Interview Questions and Answers for 2025
Lesson - 31
Explaining the Concepts of Quantum Computing
Lesson - 32
Supervised Machine Learning: All You Need to Know
Lesson - 33
10 Machine Learning Platforms to Revolutionize Your Business
Lesson - 34
What Is Boosting in Machine Learning ?: A Comprehensive Guide
Lesson - 35
Machine Learning vs. Neural Networks: Understanding the Differences
Lesson - 36
Unlocking the Future: 5 Compelling Reasons to Master Machine Learning in 2025
Lesson - 37
Feature Engineering
Lesson - 38
How to Create a Fake News Detection System?
Lesson - 39
Automated Machine Learning: A Quick Guide
Lesson - 40

Table of Contents

View More

How do we know if our model is functional? If we have trained it well? All of this can be determined by seeing how our model performs on previously unseen data, data that is completely new to it. We need to ensure that the accuracy of our model remains constant throughout. In other words, we need to validate our model.

Using cross-validation in machine learning, we can determine how our model is performing on previously unseen data and test its accuracy.

Why Do Models Lose Stability?

Any machine learning model needs to consistently predict the correct output across a variation of different input values, present in different datasets. This characteristic of a machine learning model is called stability. If a model does not change much when the input data is modified, it means that it has been trained well to generalize and find patterns in our data. A model can lose stability in two ways:

Underfitting: It occurs when the model does not fit properly to training data. It does not find patterns in the data and hence when it is given new data to predict, it cannot find patterns in it too. It under-performs on both known and unseen data.
Overfitting: When the model trains well on training data and generalizes to it, but fails to perform on new, unseen data. It captures every little variation in training data and cannot perform on data that does not have the same variations.

The figures depicted below show unfit, overfit, and optimally fit models:

underfitting-ml.

Figure 1: Underfitting

overfitting-ml

Figure 2: Overfitting

optimal-ml

Figure 3: Optimal Model

In Figure 1, we can see that the model does not quite capture all the features of our data and leaves out some important data points. This model has generalized our data too much and is under fitted. In Figure 2, our model has captured every single aspect of the data, including the noise. If we were to give it a different dataset, it would not be able to predict it as it is too specific to our training data, hence it is overfitted. In figure 3, the model captures the intricacies of our model while ignoring the noise, this model is our optimal model.

What is Cross-Validation?

While choosing machine learning models, we need to compare models to see how different models perform on our dataset and to choose the best model for our data. However, data is usually limited, our dataset might not have enough data points or may even have missing or wrong data. Further, if we have fewer data, training, and testing on the same portion of data does not give us an accurate view of how our model performs. Training a model on the same data means that the model will eventually learn well for only that data and fail on new data, this is called overfitting. This is where cross-validation comes into the picture.

Cross-Validation in machine learning is a technique that is used to train and evaluate our model on a portion of our database, before re-portioning our dataset and evaluating it on the new portions.

This means that instead of splitting our dataset into two parts, one to train on and another to test on, we split our dataset into multiple portions, train on some of these and use the rest to test on. We then use a different portion to train and test our model on. This ensures that our model is training and testing on new data at every new step.

This also exposes our model to minority classes which may be present in the data. If we split our data into two and train only on one part, there is a chance that the test data contains a minority class that was not present in the testing data. In this case, our model will still perform well as the class constitutes only a small portion of the dataset but it will be desensitized to that data.

Consider the block below to represent the entirety of our data. We partition the dataset into training and testing data. The training data will be used by our model to learn. The testing dataset will be used by our model to predict unseen data. It is used to evaluate our model’s performance.

4-partitioning-ml

Figure 4: Partitioning dataset for cross-validation

5-training-ml.

Figure 5: Training and testing with our portioned dataset

We then choose a different portion to test on and use the other portions for training. Then, the model performance is re-evaluated with the results obtained from the new portioned dataset to get better results.

6-testing-ml.

Figure 6: Training and testing on new portions

Steps in Cross-Validation

Step 1: Split the data into train and test sets and evaluate the model’s performance

The first step involves partitioning our dataset and evaluating the partitions. The output measure of accuracy obtained on the first partitioning is noted.

7-step1

Figure 7: Step 1 of cross-validation partitioning of the dataset

Step 2: Split the data into new train and test sets and re-evaluate the model’s performance

After evaluating one portion of the dataset, we choose a different portion to test and train on. The output measure obtained from this new training and testing dataset is again noted.

8-step-2ml.

Figure 8: Step 2 of cross-validation revaluation on new portions

This step is repeated multiple times until the model has been trained and evaluated on the entire dataset.

9-repeating-ml

Figure 9: Repeating Step 2 of cross-validation

Step 3: To get the actual performance metric the average of all measures is taken

10-step3

Figure 10: Step 3 of cross-validation getting model performance

Cross-Validation Models

There are various ways to perform cross-validation. Some of the commonly used models are:

K-fold cross-validation: In K-fold cross-validation, K refers to the number of portions the dataset is divided into. K is selected based on the size of the dataset.

The dataset is split into k portions one section is for testing and the rest for training.

10-kfold

Figure 11: K-Fold with k = 5

Another section will be chosen for testing and the remaining section will be for training. This will continue K number of times until all sections have been used as a testing set once.

12-selecting

Figure 12: Selecting a different dataset portion for K-Fold CV

The final performance measure will be the average of the output measures of the K iterations

13-final

Figure 13: Final accuracy using K-fold

Leave one out cross-validation (LOOCV): In LOOCV, instead of leaving out a portion of the dataset as testing data, we select one data point as the test data. The rest of the dataset will be used for training and the single data point will be used to predict after training.

Consider a dataset with N points. N-1 will be the training set and 1 point will be the testing set.

14-splitting

Figure 14: Splitting a dataset for LOOCV

Another point will be chosen as the testing data and the rest of the points will be training.

This will repeat for the rest of the dataset, i.e.: N times.

15-selecting-ml

Figure 15: Selecting another point a testing data

The final performance measure will be the average of the measures for all n iterations.

16-final

Figure 16: Performance measure for LOOCV

Stratified K-fold cross-validation: This method is useful when there are minority classes present in our data. In some cases, while partitioning the data, some testing sets will include instances of minority classes while others will not. When this happens, our accuracy will not properly reflect how well minority classes are being predicted. To overcome this, The data is split so that each portion has the same percentage of all the different classes that exist in the dataset. Consider a dataset that has 2 classes of data as shown below.

16-data

Figure 17: Data with two classes present

In normal cross-validation, the data is divided without keeping in mind the distribution of individual classes. The model, thus cannot properly predict for minority classes.

Division of data in cross-validation

Figure 18: Division of data in cross-validation

Stratified K-folds overcomes this by maintaining the same percentage of data classes in all the folds, the model can be trained even on minority classes

8-division

Figure 19: Division of data in Stratified K-Fold cross-validation

Cross-Validation With Python

Let's look at cross-validation using Python. We will be using the adult income dataset to classify people based on whether their income is above $50k or not. We will be using Linear Regression and K Nearest Neighbours classifiers and using cross-validation, we will see which one performs better.

19-adult-census

Figure 20: Adult Census Data

Importing the libraries necessary for our model:

20-importing-ml

Figure 21: Importing libraries

We have imported cross-validation module cross_val_score along with StratifiedKFold and KFold cross-validation modules.

As we can see, in our prediction class, the income is in words. Let us convert it into numeric form to make classification easier.

21-formatting-ml

Figure 22: Formatting prediction class

Let us do the same with the sex column. At the same time, these are a bunch of relationships and marital status which can be simply converted into married or unmarried and then converted into binary classes.

22-formatting-columns

Figure 23: Formatting columns

After dropping unnecessary columns, the dataset will be significantly reduced.

23-dropping-ml

Figure 24: Dropping columns and the final dataset

Let us drop the income prediction class. Hence, our training dataset becomes :

24-training-dataset

Figure 25: Training Dataset

Splitting the dataset into training and testing data and creating our models.

25-splitting.

Figure 25: Splitting our dataset and creating models

Let us perform cross-validation, first using K-Fold Cross-Validation. We have taken k as 10. We can see that linear regression performs better.

26-k-fold

Figure 27: K-Fold Cross-Validation

Now, let’s use Stratified K-Fold and see the results.

28-stratified

Figure 28: Stratified K-Fold

29-result

Figure 29: Results of Stratified K-Fold

Acelerate your career in AI and ML with the AI ML Course with Purdue University collaborated with IBM.

Conclusion

In this article - The Ultimate Guide to Cross-Validation, we have looked at what causes model instability and what cross-validation is. We looked at the steps to perform cross-validation and the various cross-validation models which are commonly used. Finally, we got hands-on training on how cross-validation can be implemented in python.

Was this article on cross-validation useful to you? Do you have any doubts or questions for us? Mention them in this article's comments section, and we'll have our experts answer them for you at the earliest.

Looking forward to becoming a Machine Learning Engineer? Check out Simplilearn's Machine Learning Course and get certified today.

About the Author

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

View More

Recommended Resources

prevNext

Acknowledgement
PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.