Handling outliers with Pandas
Outliers are extreme values that lie far away from the majority of the data points in a dataset. These values do not follow the general pattern of the data and can occur due to various reasons like data entry mistakes, measurement errors or natural variations. For example if most people in a group weigh between 50-70 kg and one person weighs 150 kg that person's weight is considered an outlier because it is higher than the others.
Outliers can be classified into two types:
- Univariate Outliers: Outliers detected in one feature or variable.
- Multivariate Outliers: Outliers detected based on relationships between two or more features.
We should Handle Outliers because they can lead to:
- Skewing of results in statistical analysis.
- Mislead machine learning models.
- Reduce model accuracy.
- Cause high error rates.
Ignoring outliers can lead to incorrect conclusions and poor model performance in regression and clustering tasks. There are several methods to identify outliers in Python:
Visualization Techniques
Visualization is one of the easiest ways to spot outliers. It gives a clear picture of the data distribution. Here we will be using Pandas, Numpy, Seaborn and Matplotlib libraries to implement these.
1. Box Plot
A box plot shows the minimum, first quartile (Q1), median, third quartile (Q3) and maximum values of the dataset. Outliers are typically plotted as individual points.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = {'Age': [22, 25, 28, 24, 23, 29, 150, 27, 26, 22]}
df = pd.DataFrame(data)
sns.boxplot(x=df['Age'])
plt.title("Box Plot for Age")
plt.show()
Output:
The plot will represents the interquartile range with a line for the median. The data point at 150 will be marked separately as an outlier.
2. Scatter Plot
A scatter plot helps to visualize the relationship between two variables and spot any unusual data points.
sns.scatterplot(x=range(len(df['Age'])), y=df['Age'])
plt.title("Scatter Plot for Age")
plt.show()
Output:
This above plot show each data point along the y-axis. The point at 150 will stand out clearly above the others.
3. Histogram
Histogram displays the frequency distribution of a dataset making it easier to identify outliers by looking at extremely low or high frequency bars.
df['Age'].hist(bins=10)
plt.title("Histogram for Age")
plt.show()
Output:
The histogram show most values grouped around 22-30 with the outlier 150 appearing far away from the rest of the data.
Statistical Methods
We have statistical methods also to detect outliers that are:
1. Z-Score Method
Z-Score method calculates how many standard deviations a data point is from the mean. Data points with a Z-score greater than 3 or less than -3 are considered outliers.
from scipy import stats
import numpy as np
import pandas as pd
data = {'Age': [22, 25, 28, 24, 23, 29, 150, 27, 26, 22]}
df = pd.DataFrame(data)
z = np.abs(stats.zscore(df['Age']))
print("Z-Score Values:\n", z)
outliers = df[z > 3]
print("Outliers:\n", outliers)
Output:
2. Interquartile Range (IQR) Method
Interquartile Range (IQR) method identifies outliers by measuring the spread between the first quartile (Q1) and third quartile (Q3). Any data point below
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]
print("Outliers:\n")
outliers
Output:
How to Handle Outliers?
Once we have detected outliers we can handle them using different methods:
1. Removing Outliers
If the outliers are due to data entry errors or measurement mistakes, removing them is the best option. This method works well if the outliers are not important for analysis. Now see how it can be done.
df_cleaned = df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]
print("Cleaned Data:\n", df_cleaned)
Output:
2. Replacing Outliers
In some cases outliers can be replaced with statistical measures like mean or median to reduce their impact without losing data. Median is preferred because it is less affected by extreme values.
df['Age'] = np.where((df['Age'] < lower_bound) | (df['Age'] > upper_bound), df['Age'].median(), df['Age'])
print("Data after Replacing Outliers:\n", df)
Output:
3. Capping Outliers
Capping limits the extreme values to predefined upper and lower bounds ensures no value exceeds these limits.
df['Age'] = np.clip(df['Age'], lower_bound, upper_bound)
print("Data after Capping Outliers:\n", df)
Output:
Handling outliers effectively is important for accurate data analysis and building reliable machine learning models.