Z score for Outlier Detection - Python
Z score (or standard score) is an important concept in statistics. It helps to understand if a data value is greater or smaller than the mean and how far away it is from the mean. More specifically, the Z score tells how many standard deviations away a data point is from the mean.
Z score = (x -mean) / std. deviation
Outlier is a data point that lies significantly outside the general pattern of data. Outliers can occur for several reasons, such as errors in data collection, measurement inconsistencies, or rare events. Detecting outliers is essential as they can skew statistical analyses and lead to inaccurate results.
Outliers can be detected using Z-scores as follows:
- A Z-score of 0 indicates that the data point is exactly at the mean.
- A positive Z-score indicates that the data point is above the mean.
- A negative Z-score indicates that the data point is below the mean.
Commonly, data points with a Z-score greater than 3 or less than -3 are considered outliers, as they lie more than 3 standard deviations away from the mean. This threshold can be adjusted based on the dataset and the specific needs of the analysis.
A normal distribution is shown below, and it is estimated that 68% of the data points lie between +/- 1 standard deviation. 95% of the data points lie between +/- 2 standard deviation 99.7% of the data points lie between +/- 3 standard deviation

If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier. For example, in a survey, it was asked how many children a person had. Suppose the data obtained from people is
1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2
Clearly, 15 is an outlier in this dataset.
Calculating Z Score using Python for Outlier Detection
Let’s demonstrate the Z-Score for detecting outlier in python:
Step 1: Import Necessary Libraries
We will import numpy, pandas, scipy and matplotlib to calculate the z-score and visualize the outlier.
import numpy as np
import pandas as pd
from scipy.stats import zscore
import matplotlib.pyplot as plt
Step 2: Create the Dataset
Provide the following dataset: 1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2 and convert this into a pandas DataFrame.
data = [1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2]
df = pd.DataFrame(data, columns=['Value'])
Step 3: Calculate Z-Scores
Now, we calculate the Z-scores for this dataset using the z-score function from scipy.stats.
df['Z-score'] = zscore(df['Value'])
print(df)
Output:
Step 4: Identify Outliers
Next, we'll identify the data points that have a Z-score greater than 3 or less than -3, which are commonly considered outliers.
outliers = df[df['Z-score'].abs() > 3]
print(outliers)
Output:
Step 5: Visualize the Data
To better understand the outliers, let’s create a scatter plot to visualize the dataset and highlight the outliers.
plt.figure(figsize=(8,6))
plt.scatter(df['Value'], np.zeros_like(df['Value']), color='blue', label='Data points')
plt.scatter(outliers['Value'], np.zeros_like(outliers['Value']), color='red', label='Outliers')
plt.title('Outlier Detection using Z-Score')
plt.xlabel('Value')
plt.legend()
plt.show()
Output:

In this case, the value 15 is an outlier because its Z-score is significantly higher than the rest of the values in the dataset.
By applying the Z-score method, you can quickly identify and deal with outliers, improving the accuracy of your data analysis and statistical models.