Open In App

Stratified K Fold Cross Validation

Last Updated : 15 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Stratified K-Fold Cross Validation is a technique used for evaluating a model. It is particularly useful for classification problems in which the class labels are not evenly distributed i.e data is imbalanced. It is a enhanced version of K-Fold Cross Validation. Key difference is that it uses stratification which allows original distribution of each class to be maintained across each fold.

For example, if your original dataset had 80% Class 0 and 20% Class 1 your folds would reflect the same proportion of classes in your data. This creates improved and more reliable accuracy metrics.

Problem with Random Splitting

Random splitting techniques like train_test_split() or regular K-Fold can create problem if they produce imbalanced class proportions in the training and test sets. For example imagine a binary classification dataset with 100 samples where:

  • 80 samples are Class 0
  • 20 samples are Class 1

Using random sampling in an 80:20 split then all 80 Class 0 in the training set and all 20 Class 1 in the test set. In this case model will never learn to classify Class 1 and would give misleading accuracy.

Now, let’s use stratified sampling on same dataset:

1. Training Set (80 samples):

  • 64 from Class 0 (80% of 80)
  • 16 from Class 1 (80% of 20)

2. Test Set (20 samples):

  • 16 from Class 0 (20% of 80)
  • 4 from Class 1 (20% of 20)

This ensures that both training and test sets provide an accurate representation of the full dataset's class proportions and better generalization in the evaluation set.

In real-world classification tasks distribution of observations per class is often imbalanced like in a medical dataset it could be the case that 90% of patients are healthy (Class 0) and 10% have a disease (Class 1). If we randomly split this data there may be some training/test sets that have very few sample or even no samples for the minority class that where Stratified K Fold Cross Validation becomes important.

Implementation of Stratified K-Fold Cross-Validation  

1. Importing Required Libraries

We will be using statistics and scikit learn module.

Python
from statistics import mean, stdev
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn import linear_model
from sklearn import datasets

2. Loading Dataset and Extracting Features

Here we will be using breast cancer dataset available in scikit learn.

  • x = cancer.data: feature/input values
  • y = cancer.target: output/class labels (0 or 1)
Python
cancer = datasets.load_breast_cancer()

x = cancer.data
y = cancer.target

3. Feature Scaling (Normalization)

  • MinMaxScaler(): scales features to a range between 0 and 1
  • fit_transform(x): fits scaler on data and applies transformation
Python
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)

4. Model and K-Fold Object Setup

Here we will be using logistic regression model.

  • StratifiedKFold(...): sets up 10-fold stratified cross-validation
  • lst_accu_stratified: empty list to store accuracy scores
Python
lr = linear_model.LogisticRegression()

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []

5. Applying Stratified K-Fold and Training Model

  • skf.split(x, y): splits dataset into stratified train-test indices
  • x_train_fold, x_test_fold: features for training and testing
  • y_train_fold, y_test_fold: labels for training and testing
Python
for train_index, test_index in skf.split(x, y):
	x_train_fold, x_test_fold = x_scaled[train_index], x_scaled[test_index]
	y_train_fold, y_test_fold = y[train_index], y[test_index]
	lr.fit(x_train_fold, y_train_fold)
	lst_accu_stratified.append(lr.score(x_test_fold, y_test_fold))

6. Printing Accuracy Results

  • max(): highest accuracy
  • min(): lowest accracy
  • mean(): average accuracy
Python
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy That can be obtained from this model is:',
	max(lst_accu_stratified)*100, '%')
print('\nMinimum Accuracy:',
	min(lst_accu_stratified)*100, '%')
print('\nOverall Accuracy:',
	mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', stdev(lst_accu_stratified))

Output: 

Screenshot-2025-04-15-160827

Here we can see that we got a overall accuracy of 96.6% and standard deviation of 0.02 which means our model is working fine.

By using Stratified K-Fold Cross Validation we can ensure that our machine learning model is evaluated fairly and consistently leading to more accurate predictions and better real-world performance.


Next Article
Practice Tags :

Similar Reads