Open In App

Handling outliers with Pandas

Last Updated : 03 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Outliers are extreme values that lie far away from the majority of the data points in a dataset. These values do not follow the general pattern of the data and can occur due to various reasons like data entry mistakes, measurement errors or natural variations. For example if most people in a group weigh between 50-70 kg and one person weighs 150 kg that person's weight is considered an outlier because it is higher than the others.

Outliers can be classified into two types:

  • Univariate Outliers: Outliers detected in one feature or variable.
  • Multivariate Outliers: Outliers detected based on relationships between two or more features.

We should Handle Outliers because they can lead to:

  • Skewing of results in statistical analysis.
  • Mislead machine learning models.
  • Reduce model accuracy.
  • Cause high error rates.

Ignoring outliers can lead to incorrect conclusions and poor model performance in regression and clustering tasks. There are several methods to identify outliers in Python:

Visualization Techniques

Visualization is one of the easiest ways to spot outliers. It gives a clear picture of the data distribution. Here we will be using Pandas, Numpy, Seaborn and Matplotlib libraries to implement these.

1. Box Plot

A box plot shows the minimum, first quartile (Q1), median, third quartile (Q3) and maximum values of the dataset. Outliers are typically plotted as individual points.

Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

data = {'Age': [22, 25, 28, 24, 23, 29, 150, 27, 26, 22]}
df = pd.DataFrame(data)

sns.boxplot(x=df['Age'])
plt.title("Box Plot for Age")
plt.show()

Output:

The plot will represents the interquartile range with a line for the median. The data point at 150 will be marked separately as an outlier.

2. Scatter Plot

A scatter plot helps to visualize the relationship between two variables and spot any unusual data points.

Python
sns.scatterplot(x=range(len(df['Age'])), y=df['Age'])
plt.title("Scatter Plot for Age")
plt.show()

Output:

This above plot show each data point along the y-axis. The point at 150 will stand out clearly above the others.

3. Histogram

Histogram displays the frequency distribution of a dataset making it easier to identify outliers by looking at extremely low or high frequency bars.

Python
df['Age'].hist(bins=10)
plt.title("Histogram for Age")
plt.show()

Output:

The histogram show most values grouped around 22-30 with the outlier 150 appearing far away from the rest of the data.

Statistical Methods

We have statistical methods also to detect outliers that are:

1. Z-Score Method

Z-Score method calculates how many standard deviations a data point is from the mean. Data points with a Z-score greater than 3 or less than -3 are considered outliers.

Python
from scipy import stats
import numpy as np
import pandas as pd

data = {'Age': [22, 25, 28, 24, 23, 29, 150, 27, 26, 22]}
df = pd.DataFrame(data)

z = np.abs(stats.zscore(df['Age']))
print("Z-Score Values:\n", z)

outliers = df[z > 3]
print("Outliers:\n", outliers)

Output:

Z-Score
Identifies outliers by using Z-Score Method

2. Interquartile Range (IQR) Method

Interquartile Range (IQR) method identifies outliers by measuring the spread between the first quartile (Q1) and third quartile (Q3). Any data point below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is considered an outlier.

Python
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]
print("Outliers:\n")
outliers

Output:

outliers
Identifies outliers by using IQR

How to Handle Outliers?

Once we have detected outliers we can handle them using different methods:

1. Removing Outliers

If the outliers are due to data entry errors or measurement mistakes, removing them is the best option. This method works well if the outliers are not important for analysis. Now see how it can be done.

Python
df_cleaned = df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]
print("Cleaned Data:\n", df_cleaned)

Output:

Cleaned-Data
Removing Outliers

2. Replacing Outliers

In some cases outliers can be replaced with statistical measures like mean or median to reduce their impact without losing data. Median is preferred because it is less affected by extreme values.

Python
df['Age'] = np.where((df['Age'] < lower_bound) | (df['Age'] > upper_bound), df['Age'].median(), df['Age'])
print("Data after Replacing Outliers:\n", df)

Output:

Output
Replacing Outliers

3. Capping Outliers

Capping limits the extreme values to predefined upper and lower bounds ensures no value exceeds these limits.

Python
df['Age'] = np.clip(df['Age'], lower_bound, upper_bound)
print("Data after Capping Outliers:\n", df)

Output:

Output
Capping Outliers

Handling outliers effectively is important for accurate data analysis and building reliable machine learning models.


Next Article
Article Tags :

Similar Reads