Handling outliers with Pandas

Last Updated : 03 Jun, 2025

Outliers are extreme values that lie far away from the majority of the data points in a dataset. These values do not follow the general pattern of the data and can occur due to various reasons like data entry mistakes, measurement errors or natural variations. For example if most people in a group weigh between 50-70 kg and one person weighs 150 kg that person's weight is considered an outlier because it is higher than the others.

Outliers can be classified into two types:

Univariate Outliers: Outliers detected in one feature or variable.
Multivariate Outliers: Outliers detected based on relationships between two or more features.

We should Handle Outliers because they can lead to:

Skewing of results in statistical analysis.
Mislead machine learning models.
Reduce model accuracy.
Cause high error rates.

Ignoring outliers can lead to incorrect conclusions and poor model performance in regression and clustering tasks. There are several methods to identify outliers in Python:

Visualization Techniques

Visualization is one of the easiest ways to spot outliers. It gives a clear picture of the data distribution. Here we will be using Pandas, Numpy, Seaborn and Matplotlib libraries to implement these.

1. Box Plot

A box plot shows the minimum, first quartile (Q1), median, third quartile (Q3) and maximum values of the dataset. Outliers are typically plotted as individual points.

Python

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

data = {'Age': [22, 25, 28, 24, 23, 29, 150, 27, 26, 22]}
df = pd.DataFrame(data)

sns.boxplot(x=df['Age'])
plt.title("Box Plot for Age")
plt.show()

Output:

The plot will represents the interquartile range with a line for the median. The data point at 150 will be marked separately as an outlier.

2. Scatter Plot

A scatter plot helps to visualize the relationship between two variables and spot any unusual data points.

Python

sns.scatterplot(x=range(len(df['Age'])), y=df['Age'])
plt.title("Scatter Plot for Age")
plt.show()

Output:

This above plot show each data point along the y-axis. The point at 150 will stand out clearly above the others.

3. Histogram

Histogram displays the frequency distribution of a dataset making it easier to identify outliers by looking at extremely low or high frequency bars.

Python

df['Age'].hist(bins=10)
plt.title("Histogram for Age")
plt.show()

Output:

The histogram show most values grouped around 22-30 with the outlier 150 appearing far away from the rest of the data.

Statistical Methods

We have statistical methods also to detect outliers that are:

1. Z-Score Method

Z-Score method calculates how many standard deviations a data point is from the mean. Data points with a Z-score greater than 3 or less than -3 are considered outliers.

Python

from scipy import stats
import numpy as np
import pandas as pd

data = {'Age': [22, 25, 28, 24, 23, 29, 150, 27, 26, 22]}
df = pd.DataFrame(data)

z = np.abs(stats.zscore(df['Age']))
print("Z-Score Values:\n", z)

outliers = df[z > 3]
print("Outliers:\n", outliers)

Output:

Identifies outliers by using Z-Score Method

2. Interquartile Range (IQR) Method

Interquartile Range (IQR) method identifies outliers by measuring the spread between the first quartile (Q1) and third quartile (Q3). Any data point below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is considered an outlier.

Python

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]
print("Outliers:\n")
outliers

Output:

How to Handle Outliers?

Once we have detected outliers we can handle them using different methods:

1. Removing Outliers

If the outliers are due to data entry errors or measurement mistakes, removing them is the best option. This method works well if the outliers are not important for analysis. Now see how it can be done.

Python

df_cleaned = df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]
print("Cleaned Data:\n", df_cleaned)

Output:

2. Replacing Outliers

In some cases outliers can be replaced with statistical measures like mean or median to reduce their impact without losing data. Median is preferred because it is less affected by extreme values.

Python

df['Age'] = np.where((df['Age'] < lower_bound) | (df['Age'] > upper_bound), df['Age'].median(), df['Age'])
print("Data after Replacing Outliers:\n", df)

Output:

3. Capping Outliers

Capping limits the extreme values to predefined upper and lower bounds ensures no value exceeds these limits.

Python

df['Age'] = np.clip(df['Age'], lower_bound, upper_bound)
print("Data after Capping Outliers:\n", df)

Output:

Handling outliers effectively is important for accurate data analysis and building reliable machine learning models.

Handling outliers with Pandas

ayushimalm50

Improve

Article Tags :

Pandas

Handling outliers with Pandas

Visualization Techniques

1. Box Plot

2. Scatter Plot

3. Histogram

Statistical Methods

1. Z-Score Method

2. Interquartile Range (IQR) Method

How to Handle Outliers?

1. Removing Outliers

2. Replacing Outliers

3. Capping Outliers

Similar Reads

Thank You!

What kind of Experience do you want to share?