Z score for Outlier Detection - Python

Last Updated : 03 Apr, 2025

Z score (or standard score) is an important concept in statistics. It helps to understand if a data value is greater or smaller than the mean and how far away it is from the mean. More specifically, the Z score tells how many standard deviations away a data point is from the mean.

Z score = (x -mean) / std. deviation

Outlier is a data point that lies significantly outside the general pattern of data. Outliers can occur for several reasons, such as errors in data collection, measurement inconsistencies, or rare events. Detecting outliers is essential as they can skew statistical analyses and lead to inaccurate results.

Outliers can be detected using Z-scores as follows:

A Z-score of 0 indicates that the data point is exactly at the mean.
A positive Z-score indicates that the data point is above the mean.
A negative Z-score indicates that the data point is below the mean.

Commonly, data points with a Z-score greater than 3 or less than -3 are considered outliers, as they lie more than 3 standard deviations away from the mean. This threshold can be adjusted based on the dataset and the specific needs of the analysis.

A normal distribution is shown below, and it is estimated that 68% of the data points lie between +/- 1 standard deviation. 95% of the data points lie between +/- 2 standard deviation 99.7% of the data points lie between +/- 3 standard deviation

If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier. For example, in a survey, it was asked how many children a person had. Suppose the data obtained from people is

1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2

Clearly, 15 is an outlier in this dataset.

Calculating Z Score using Python for Outlier Detection

Let’s demonstrate the Z-Score for detecting outlier in python:

Step 1: Import Necessary Libraries

We will import numpy, pandas, scipy and matplotlib to calculate the z-score and visualize the outlier.

Python

import numpy as np
import pandas as pd
from scipy.stats import zscore
import matplotlib.pyplot as plt

Step 2: Create the Dataset

Provide the following dataset: 1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2 and convert this into a pandas DataFrame.

Python

data = [1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2]
df = pd.DataFrame(data, columns=['Value'])

Step 3: Calculate Z-Scores

Now, we calculate the Z-scores for this dataset using the z-score function from scipy.stats.

Python

df['Z-score'] = zscore(df['Value'])
print(df)

Output:

Step 4: Identify Outliers

Next, we'll identify the data points that have a Z-score greater than 3 or less than -3, which are commonly considered outliers.

Python

outliers = df[df['Z-score'].abs() > 3]
print(outliers)

Output:

Step 5: Visualize the Data

To better understand the outliers, let’s create a scatter plot to visualize the dataset and highlight the outliers.

Python

plt.figure(figsize=(8,6))
plt.scatter(df['Value'], np.zeros_like(df['Value']), color='blue', label='Data points')
plt.scatter(outliers['Value'], np.zeros_like(outliers['Value']), color='red', label='Outliers')
plt.title('Outlier Detection using Z-Score')
plt.xlabel('Value')
plt.legend()
plt.show()

Output:

In this case, the value 15 is an outlier because its Z-score is significantly higher than the rest of the values in the dataset.

By applying the Z-score method, you can quickly identify and deal with outliers, improving the accuracy of your data analysis and statistical models.

Clustering-Based approaches for outlier detection in data mining

ektamaini

Improve

Article Tags :

Practice Tags :

Z score for Outlier Detection - Python

Calculating Z Score using Python for Outlier Detection

Step 1: Import Necessary Libraries

Step 2: Create the Dataset

Step 3: Calculate Z-Scores

Step 4: Identify Outliers

Step 5: Visualize the Data

Similar Reads

Prerequisites for Data Analysis

Data Analysis Libraries

Understanding the Data

Loading the Data

Data Cleaning

Handling Missing Data

Outliers Detection

Exploratory Data Analysis

Time Series Data Analysis

Thank You!

What kind of Experience do you want to share?