Anomaly Detection Techniques for Large Datasets

Last Updated : 20 Feb, 2025

In today's data-driven world, organizations manage huge datasets that are generated from multiple sources like financial transactions, network activity, social media data, and IoT sensor information. Analysis of this data provides organizations with valuable insights that can be used for better decision-making and identifying risks. With the growing data sizes and increasing complexity of organizations, proper anomaly detection is becoming more difficult. Traditional methods have limitations when dealing with large data sets; thus new advanced techniques are needed to both process datasets quickly and give accurate results.

Anomaly-Detection-Techniques-for-Large-Datasets — Anomaly Detection for Large Datasets

This article will discuss various techniques for detecting anomalies in large datasets. The methods have been placed into categories like Statistical methods, Machine learning-based approaches, Deep learning models, Time series analysis, and Distance-based methods which makes it easier to pick the right approach based on data characteristics and business needs.

Table of Content

What is Anomaly Detection?

Anomaly Detection is defined as the method to identify data elements that are different from normal data points. These points are called anomalies or outliers and can signify important events for fraud detection, system failures, network intrusions, or data errors. This is done to prevent issues and improve decision-making in industries such as finance, healthcare, and cybersecurity by detecting them early.

Anomaly detection for millions of data points can be quite challenging. This is because normal patterns are dynamic and may change over a period of time, thereby making it challenging to identify what is an anomaly. Some anomalies are very rare and can be embedded in the data, which makes it even more difficult to identify them. This is why we employ various methods, starting from basic statistical methods to the application of machine learning and even deep learning approaches. These techniques also assist in the analysis of large datasets and identifying the important data that requires attention.

Anomaly Detection Techniques

Statistical Methods

Statistical methods are simple yet effective techniques for detecting anomalies. They are commonly used when the data follows a normal distribution and can be quickly implemented for smaller datasets. Let’s look at two popular statistical methods:

1. Z-Score

The Z-Score method works by calculating how far a data point deviates from the mean in terms of standard deviation to identify the outliers. Data points with significantly high or low Z-Scores are considered as outliers because they sit far from the majority of the data. This method is helpful for data that follows a normal distribution with most points clustering near the mean. Tasks such as fraud detection in transactions, can benefit from the use of Z-Scores. However for datasets that do not follow the normal distribution or have complicated structures, other methods such as IQR or machine learning based approaches may be more effective.

Formula: Z = \frac{(X - \mu)}{\sigma}

In the above formula:

X= data point
\mu = mean of the dataset
\sigma = standard deviation

Generating a Synthetic Dataset using make_blobs

Python

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
X, _ = make_blobs(n_samples=300, centers=[[0, 0], [5, 5]], cluster_std=0.5, random_state=42)

X = np.vstack([X, [[10, 10], [15, 15], [-5, -5]]])

plt.scatter(X[:, 0], X[:, 1], color='blue', label='Data')
plt.scatter([10, 15, -5], [10, 15, -5], color='red', label='Anomalies', marker='x')
plt.legend()
plt.show()

Output:

Applying Z-Score Anomaly detection technique:

Python

from scipy import stats
import numpy as np

z_scores = np.abs(stats.zscore(X))
anomalies = np.where(np.any(z_scores > 3, axis=1))
print("Anomalies detected at indices (Z-score):", anomalies[0])

Output:

Anomalies detected at indices (Z-score): [301]

2. Interquartile Range (IQR)

The Interquartile Range (IQR) method detects anomalies by focusing on the spread of the data between the first quartile (Q1) and the third quartile (Q3), that is, the middle 50% of the data. If a data point lies outside the range of then it is considered as potential anomaly. This method is effective for non-gaussian data and is resistant to extreme values, compared to mean and standard deviation based methods. For instance, it can detect abnormally short or long delivery times in a dataset of delivery times. Popular in fields such as logistics, healthcare, and sensor data analysis, the IQR method does well with skewed data and datasets with outliers.

Formula: IQR = Q3 - Q1

Python

Q1 = np.percentile(X, 25, axis=0)
Q3 = np.percentile(X, 75, axis=0)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

anomalies = np.where((X < lower_bound) | (X > upper_bound))
unique_anomalies = np.unique(anomalies[0])
print("Anomalies detected at indices (IQR):", unique_anomalies)

Output:

Anomalies detected at indices (IQR): [301]

Supervised Machine Learning-Based Techniques

3. Logistic Regression

Logistic Regression is a type of supervised learning approach for binary classification. In anomaly detection, it is used to compute the probability that an observation is an outlier. The output is a number between 0 and 1 which denotes this probability. The data points for which probabilities are greater than or less than the threshold (usually 0.5) are marked as anomalies.

It is easy to implement and does well with labeled data. For example, in fraud detection, it can be used to classify transactions as normal or fraudulent given past data. However, it can do poorly with highly imbalanced datasets; thus, techniques to enhance the accuracy are required.

Python

from sklearn.linear_model import LogisticRegression

y = np.array([0] * 300 + [1, 1, 1])

model = LogisticRegression()
model.fit(X, y)

probs = model.predict_proba(X)[:, 1]
threshold = 0.5
anomalies = np.where(probs > threshold)[0]

print("Anomalies detected at indices:", anomalies)

Output:

Anomalies detected at indices: [301]

4. Support Vector Machine (SVM)

The Support Vector Machine (SVM) algorithm is a type of supervised learning that is used for both classification purposes and detecting anomalies. It determines the best possible gap between normal data and anomalies through the identification of the optimal boundary. SVM can effectively handle high dimensional data using kernel functions which allow it to capture non-linear patterns.

SVM applications include identifying normal traffic patterns and malicious activities. It can be costly in terms of computation, especially with large data sets; nonetheless, its strength in performing its task even in complex scenarios makes it a popular choice for anomaly detection.

Python

from sklearn.svm import OneClassSVM
import numpy as np

model = OneClassSVM(nu=0.01, kernel="rbf", gamma=0.01)
model.fit(X)

predictions = model.predict(X)
anomalies = np.where(predictions == -1)[0]

print("Anomalies detected at indices:", anomalies)

Output:

Anomalies detected at indices: [148 300 301 302]

The extra anomalies (indices 148, 300, 302) were detected by One-Class SVM due to its flexible decision boundary, which is more sensitive to data points that deviate slightly from the normal distribution. This allows it to identify points near the cluster's boundary as anomalies, unlike Z-score or IQR, which are stricter in their criteria.

Unsupervised Machine Learning-Based Techniques

5. K-Means Clustering

K-Means Clustering is an unsupervised learning algorithm that groups data into clusters based on similarity. It identifies anomalies by measuring how close a data point is to its nearest cluster centroid in anomaly detection. Data points which have large distances from all centroids are flagged as anomalies since they do not belong to any cluster.

Customer segmentation, an application of K-Means, reveals unusual customer behaviour which can be classified as an anomaly if they don't belong to a defined group. K-Means is a simple and efficient method but can be inadequate for datasets with clusters of different shapes and densities.

Python

from sklearn.cluster import KMeans
import numpy as np

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

distances = np.min(kmeans.transform(X), axis=1)
threshold = np.percentile(distances, 99)
anomalies = np.where(distances > threshold)[0]

print("Anomalies detected at indices:", anomalies)

Output:

Anomalies detected at indices: [218 300 301 302]

6. Isolation Forest

The Isolation Forest algorithm is a type of tree-based method which has been developed particularly for anomaly detection purposes. The approach entails the division of the data and then single out data points. Since anomalies have a large variance from normal data, they can be detected with fewer number of splits in the tree. The algorithm produces a score for the data, and the higher the score the more likely it is to be an anomaly.

This method is very scalable and does well with large datasets. It is commonly used in fraud detection, network security, and manufacturing to identify rare events or faults. Because it is fast and can work with high dimensional data, it is a preferred choice for the anomaly detection task.

Python

from sklearn.ensemble import IsolationForest
import numpy as np

model = IsolationForest(contamination=0.01, n_estimators=100, max_samples='auto', random_state=42)
model.fit(X)

anomalies = model.predict(X)
anomalies = np.where(anomalies == -1)[0]

print("Anomalies detected at indices:", anomalies)

Output:

Anomalies detected at indices: [183 300 301 302]

7. DBSCAN

The algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. It defines clusters based on the density of data and includes all the points in the low density region as anomalies. DBSCAN can identify the clusters of different shapes and sizes unlike K-Means.

For instance, in sensor data, DBSCAN can identify sudden deviations that take place out of the dense region of data. It is robust to noisy data and does not need to know the number of clusters beforehand, which makes it suitable for the complex data set. However, it could be problematic in setting the right values for parameters such as the minimum number of points and distance threshold.

Python

from sklearn.cluster import DBSCAN
import numpy as np

dbscan = DBSCAN(eps=1.0, min_samples=5)
predictions = dbscan.fit_predict(X)

anomalies = np.where(predictions == -1)[0]

print("Anomalies detected at indices:", anomalies)

Output:

Anomalies detected at indices: [300 301 302]

Deep Learning Techniques

8. Autoencoders

An autoencoder is a kind of neural network that learns to represent data in lower dimension and then reconstruct it from that representation. During training, a model is trained to minimize the reconstruction error for usual data. This is why, in anomaly detection, the data with high reconstruction error, which does not fit the learned patterns is marked as an anomaly.

They are also used in detecting anomalies in high dimensional data such as network traffic and image based data. However, since they work well when normal data is thoroughly represented in the training set, they might need to be fine-tuned to not overfit.

Python

from keras.models import Model
from keras.layers import Input, Dense
import numpy as np

input_dim = X.shape[1]
input_layer = Input(shape=(input_dim,))
encoded = Dense(32, activation='relu')(input_layer)
decoded = Dense(input_dim, activation='sigmoid')(encoded)

autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

autoencoder.fit(X, X, epochs=50, batch_size=256, shuffle=True)

reconstructed = autoencoder.predict(X)
reconstruction_error = np.mean(np.abs(X - reconstructed), axis=1)

threshold = np.percentile(reconstruction_error, 99)
anomalies = np.where(reconstruction_error > threshold)[0]

print("Anomalies detected at indices:", anomalies)

Output:

Anomalies detected at indices: [206 300 301 302]

9. Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a kind of sequential data processing neural networks and are very effective in handling time series data for anomaly detection. It can detect unusual activity in time sequences of a data set, which helps detect anomalies. The detection of anomalies can be done by comparing the predicted sequences with actual ones, when the prediction error exceeds a certain threshold.

For example, RNNs are used in anomaly detection for stock prices or sensor data in industrial systems. Long Short-Term Memory (LSTM), a type of RNN, is particularly useful for handling long sequences and reducing errors in prediction-based anomaly detection.

Python

from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np

X_seq = X.reshape((X.shape[0], 1, X.shape[1]))  # 3D input (samples, time_steps, features)

model = Sequential()
model.add(LSTM(units=50, return_sequences=False, input_shape=(X_seq.shape[1], X_seq.shape[2])))
model.add(Dense(X.shape[1]))  # Output layer to match the input dimensions of X
model.compile(optimizer='adam', loss='mean_squared_error')

model.fit(X_seq, X, epochs=50, batch_size=32)

predictions = model.predict(X_seq)
prediction_error = np.mean(np.abs(X - predictions), axis=1)

threshold = np.percentile(prediction_error, 99)
anomalies = np.where(prediction_error > threshold)[0]

print("Anomalies detected at indices:", anomalies)

Output:

Anomalies detected at indices: [278 300 301 302]

10. GANs (Generative Adversarial Networks)

Generative Adversarial Networks (GANs) consist of two neural networks: a generator and a discriminator that respectively work against each other. The generator tries to create data similar to the real dataset and the discriminator works to distinguish between real data and data produced by the generator. During anomaly detection, we mark anomalies as data which the discriminator rejects as likely generated by the model.

GANs are effective when modeling complex data distributions like images and audio. They have applications in areas like fraud detection and video surveillance which are beyond the capabilities of traditional models. However, it is difficult to train GANs because of problems like instability and mode collapse.

Python

from keras.models import Sequential, Model
from keras.layers import Dense, Input
import numpy as np
from keras.optimizers import Adam

def build_generator():
    model = Sequential()
    model.add(Dense(128, input_dim=X.shape[1], activation='relu'))
    model.add(Dense(X.shape[1], activation='sigmoid'))
    return model

def build_discriminator():
    model = Sequential()
    model.add(Dense(128, input_dim=X.shape[1], activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    return model

generator = build_generator()
discriminator = build_discriminator()

input_layer = Input(shape=(X.shape[1],))
generated_data = generator(input_layer)
discriminator_output = discriminator(generated_data)

gan = Model(input_layer, discriminator_output)

optimizer = Adam(learning_rate=0.0001)
gan.compile(optimizer=optimizer, loss='binary_crossentropy')

for epoch in range(1000):
    fake_data = generator.predict(np.random.randn(32, X.shape[1]))
    gan.train_on_batch(np.random.randn(32, X.shape[1]), np.ones(32))

discriminator_output = discriminator.predict(X)
threshold = np.percentile(discriminator_output, 0.5)
anomalies = np.where(discriminator_output < threshold)[0]

print("Anomalies detected at indices:", anomalies)

Output:

Anomalies detected at indices: [315 667 766 816 896]

Due to the small dataset, the model had limited examples to learn from, leading to inaccurate anomaly detection. This lack of sufficient data caused the model to struggle in distinguishing between normal and anomalous instances effectively.

Time-Series Anomaly Detection (for sequential data)

11. ARIMA (AutoRegressive Integrated Moving Average)

ARIMA is one of the most widely used statistical models to analyze and forecast time series data. It works by first identifying the trend and seasonality in the data and then using the predicted values to identify anomalies. The differences between predicted and actual values are labelled as anomalies.

It is commonly used in financial data, sales forecasting, and network monitoring. ARIMA works well for data with clear patterns but may require manual tuning of parameters (p, d, q) to achieve accurate results.

Generating a random dataset to demonstrate anomaly detection using ARIMA:

Python

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
import pandas as pd

np.random.seed(42)

t = np.linspace(0, 365, 365)
seasonal_pattern = 10 * np.sin(2 * np.pi * t / 365)
trend = 0.01 * t

noise = np.random.normal(0, 0.3, 365)
data = seasonal_pattern + trend + noise

anomaly_indices = [50, 150, 250]
anomaly_values = [8, -7, 9]
for idx, val in zip(anomaly_indices, anomaly_values):
    data[idx] += val

df = pd.DataFrame({
    'value': data,
    'is_anomaly': [1 if i in anomaly_indices else 0 for i in range(len(data))]
})

plt.figure(figsize=(15, 5))
plt.plot(df.index, df['value'], label='Time Series')
plt.scatter(df[df['is_anomaly'] == 1].index, 
           df[df['is_anomaly'] == 1]['value'], 
           color='red', label='True Anomalies', s=100)
plt.title('Synthetic Time Series with Anomalies')
plt.legend()
plt.show()

Output:

Anomaly Detection:

Python

model = ARIMA(data, order=(5, 1, 2))
model_fit = model.fit()

predictions = model_fit.predict(start=0, end=len(data)-1)
prediction_error = np.abs(data - predictions)
threshold = np.percentile(prediction_error, 99)

min_error = 2.0
detected_anomalies = np.where((prediction_error > threshold) & 
                            (prediction_error > min_error))[0]

print(f"True anomaly indices: {anomaly_indices}")
print(f"Detected anomaly indices: {sorted(detected_anomalies)}")

Output:

True anomaly indices: [50, 150, 250]
Detected anomaly indices: [np.int64(50), np.int64(150), np.int64(250)]

12. LSTM (Long Short-Term Memory Networks)

LSTM is a kind of RNN which is specifically designed to manage long-term dependencies in sequential data. In time series anomaly detection, LSTM is used to generate the next values of a sequence and then check them against actual values. High prediction errors can be considered as potential anomalies.

They are widely used in applications like monitoring of sensors, predictive maintenance, and analysis of stock prices etc. Due to their capacity to learn long range dependencies, they outperform traditional models for time series data.

We have already demonstrated anomaly detection using LSTM in an earlier example under the RNN section.

Distance-Based Techniques

13. k-Nearest Neighbors (k-NN)

The k-NN algorithm is based on the concept of distance and is used for anomaly detection where the distance of a data point from its k nearest neighbors is used. If the distance is significantly larger than that of most other points, it labels that point as an anomaly.

This method is simple and is quite effective for low dimensional data. It is usually applied in fraud detection, network intrusion detection, and credit card transaction monitoring. However, if the size of the dataset is large, it can be computationally expensive.

Python

from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=[[0, 0], [5, 5]], cluster_std=0.5, random_state=42)
X = np.vstack([X, [[10, 10], [15, 15], [-5, -5]]])

knn = NearestNeighbors(n_neighbors=5)
knn.fit(X)

distances, _ = knn.kneighbors(X)
distance_to_5th_nearest_neighbor = distances[:, -1]

threshold = np.percentile(distance_to_5th_nearest_neighbor, 99)
anomalies = np.where(distance_to_5th_nearest_neighbor > threshold)[0]

print("Anomalies detected at indices:", anomalies)

Output:

Anomalies detected at indices: [218 300 301 302]

Conclusion

In conclusion, anomaly detection is a crucial tool across various industries, enabling the identification of outliers or unusual patterns that could indicate important issues like fraud, system failures, or rare occurrences. By utilizing a combination of statistical methods, machine learning algorithms, and distance-based techniques, organizations can effectively detect anomalies and gain deeper insights into their data. The right approach depends on the nature of the dataset and the problem at hand, but the ability to recognize and address these anomalies is essential for improving decision-making and enhancing overall system performance and efficiency.

Anomaly Detection Techniques for Large Datasets

sparshbouxt

Improve

Article Tags :

Practice Tags :

Machine Learning

Anomaly Detection Techniques for Large Datasets

What is Anomaly Detection?

Anomaly Detection Techniques

Statistical Methods

1. Z-Score

2. Interquartile Range (IQR)

Supervised Machine Learning-Based Techniques

3. Logistic Regression

4. Support Vector Machine (SVM)

Unsupervised Machine Learning-Based Techniques

5. K-Means Clustering

6. Isolation Forest

7. DBSCAN

Deep Learning Techniques

8. Autoencoders

9. Recurrent Neural Networks (RNNs)

10. GANs (Generative Adversarial Networks)

Time-Series Anomaly Detection (for sequential data)

11. ARIMA (AutoRegressive Integrated Moving Average)

12. LSTM (Long Short-Term Memory Networks)

Distance-Based Techniques

13. k-Nearest Neighbors (k-NN)

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?