Introduction to UpSampling and DownSampling Imbalanced Data in Python

Last Updated : 20 Jan, 2025

Imbalanced datasets are a common challenge in machine learning, where one class significantly outweighs another. This imbalance can lead to biased model predictions. Two primary techniques to address this issue are UpSampling and DownSampling:

UpSampling: Increases the number of samples in the minority class.
DownSampling: Reduces the majority class size to match the minority class.

In this article, we will explore these techniques, their implementation in Python using libraries like imbalanced-learn, and how to optimize them for better machine learning performance.

Why Are Imbalanced Datasets a Problem?

Imbalanced data occurs when the target class distribution is uneven, with one class having significantly more observations than the other.

For example:

Fraud detection datasets often have a high number of legitimate transactions (majority class) and very few fraudulent ones (minority class).
Imbalanced data can lead to models being biased toward the majority class, resulting in poor performance for the minority class, particularly in critical applications like fraud detection or medical diagnosis.

Imbalanced datasets can produce misleading metrics like accuracy. For example, in a dataset where 95% of cases belong to the majority class, a model predicting only the majority class achieves 95% accuracy but fails to identify the minority class.

Key Considerations When Handling Imbalanced Data

Avoid Data Leakage: Split the dataset into training and testing sets before applying resampling techniques. This prevents the model from learning patterns it wouldn't encounter in real-world scenarios.
Use Synthetic Methods: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) generate synthetic samples for the minority class, improving diversity and model performance.
Experiment with Both Techniques: Test both upsampling and downsampling to find the optimal approach for your dataset. The choice depends on the dataset size, model requirements, and application.

What is UpSampling?

UpSampling addresses class imbalance in datasets by increasing the number of samples in the minority class. This can be achieved by duplicating existing samples or generating new synthetic samples through methods like SMOTE (Synthetic Minority Over-sampling Technique).

Example: In a fraud detection dataset with 1,000 legitimate transactions (Class 0) and 50 fraudulent transactions (Class 1), upsampling duplicates or synthesizes fraudulent transactions to balance the dataset.

Upsampling improves minority class representation during training, leading to better model performance.

What is DownSampling?

DownSampling reduces the number of samples in the majority class to match that of the minority class. This involves randomly selecting a subset of samples from the majority class until its size aligns with minority class.

Example: In a medical diagnosis context where you have 950 healthy patients (Class 0) and only 50 patients with a rare disease (Class 1), downsampling would involve randomly selecting 50 healthy patients to match the number of patients with the disease. This helps ensure that both classes are equally represented during training.

Downsampling reduces bias toward the majority class and helps focus on minority class predictions.

Steps for UpSampling and DownSampling in Python

Step 1 . Installing Required Libraries

To perform upsampling and downsampling, you need to install the libraries:

Python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml

Step 2 : Load the Dataset:

Python

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=column_names)

data.head()

Output:

Step 3: Check the Class Distribution

Python

from collections import Counter
y = data['Outcome']
print(f"Original Class Distribution: {Counter(y)}")

Output:

Original Class Distribution: Counter({0: 500, 1: 268})

Step 4: Perform UpSampling:

UpSampling increase the number of samples in the minority class by adding data points. Here, we use RandomOverSampler from the imbalanced-learn library.

Python

from imblearn.over_sampling import RandomOverSampler

X = data.drop('Outcome', axis=1)
y = data['Outcome']

ros = RandomOverSampler(sampling_strategy='minority')
X_resampled, y_resampled = ros.fit_resample(X, y)

print(f"After Upsampling: {Counter(y_resampled)}")

Output:

After Upsampling: Counter({1: 500, 0: 500})

Step 5. Downsampling the Majority Class

Downsampling reduces the number of samples in the majority class by randomly removing data points. Here, we use RandomUnderSampler from the imbalanced-learn library.

Python

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy='majority')
X_resampled_down, y_resampled_down = rus.fit_resample(X, y)

print(f"After Downsampling: {Counter(y_resampled_down)}")

Output:

After Downsampling: Counter({0: 268, 1: 268})

Step 6: Visulaization to check the difference

Python

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].bar(Counter(y).keys(), Counter(y).values(), color='b')
axes[0].set_title('Original Distribution')

axes[1].bar(Counter(y_resampled).keys(), Counter(y_resampled).values(), color='g')
axes[1].set_title('After Upsampling')

axes[2].bar(Counter(y_resampled_down).keys(), Counter(y_resampled_down).values(), color='r')
axes[2].set_title('After Downsampling')

plt.tight_layout()
plt.show()

Output:

Key Differences Between Upsampling and Downsampling

Aspect	Upsampling	Downsampling
Class Affected	Increases minority class size	Reduces majority class size
Data Loss	No data loss (duplicates data)	Data loss (removes samples)
Dataset Size	Increases the dataset size	Reduces the dataset size
Best Use Case	Small datasets where data is valuable	Large datasets where data removal is acceptable

Handling imbalanced datasets with UpSampling and DownSampling improves machine learning models' reliability, especially for critical tasks like fraud detection or medical diagnosis. By balancing the dataset, you ensure your model performs well across all classes, minimizing bias and enhancing accuracy.

Introduction to UpSampling and DownSampling Imbalanced Data in Python

ayushimalm50

Improve

Article Tags :

Practice Tags :

Machine Learning

Introduction to UpSampling and DownSampling Imbalanced Data in Python

Why Are Imbalanced Datasets a Problem?

Key Considerations When Handling Imbalanced Data

What is UpSampling?

What is DownSampling?

Steps for UpSampling and DownSampling in Python

Step 1 . Installing Required Libraries

Step 2 : Load the Dataset:

Step 3: Check the Class Distribution

Step 4: Perform UpSampling:

Step 5. Downsampling the Majority Class

Step 6: Visulaization to check the difference

Key Differences Between Upsampling and Downsampling

Similar Reads

Thank You!

What kind of Experience do you want to share?