How to check the accuracy of your Machine Learning model?

Last Updated : 09 Jan, 2025

Accuracy evaluates how well a machine learning model performs. It represents the percentage of correct predictions made by the model. While simple to calculate and understand, accuracy is most effective when the dataset is balanced.

In this article, we are going to learn how to measure the accuracy of the model and other evaluation metrics.

How to Measure Accuracy?

Accuracy is easy to understand and works well when the data is balanced i.e all classes in the dataset are equally represented. However accuracy isn’t always the best measure of performance. In datasets where one class dominates like 95% of cases being "negative" and only 5% "positive" then model predicting only the majority class would have high accuracy but wouldn’t solve the problem effectively.

To calculate accuracy we compare the model's predictions to the actual values. Count how many predictions were correct and divide that by the total number of predictions. we can break it down like this:

True Positives (TP): Correctly predicted as positive.
True Negatives (TN): Correctly predicted as negative.
False Positives (FP): Predicted positive but actually negative.
False Negatives (FN): Predicted negative but actually positive.

The full formula is:

\text{Accuracy} = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}

For example, if a model is tested on 100 cases and gets 85 correct, the accuracy is:

\text{Accuracy} = \frac{85}{100} = 85\%

You can also calculate accuracy easily with Python using libraries like scikit-learn:

Python

from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 1, 0, 0]

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy Score:", accuracy)

Accuracy Score: 0.8

Accuracy Paradox

The accuracy paradox happens when a machine learning model seems to perform well because it has high accuracy but it’s not actually making useful predictions. This problem is common in datasets where one class appears much more frequently than others i.e it is a imbalanced datasets. In these cases the model may focus on predicting the majority class correctly but it fails to predict the minority class which is often more important.

Example: Imagine a health model predicting whether a patient has cancer:

95% of patients are healthy
5% of patients have cancer

A model that always predicts "healthy" would have 95% accuracy because it correctly identifies the majority class i.e healthy patients. However it would miss all cancer cases which is a serious problem. This high accuracy score gives a false sense of success and hides the model’s failure to predict the minority class.

How to Solve the Accuracy Paradox?

Instead of relying on accuracy alone we use other metrics to evaluate your model more effectively:

Precision: Focuses on the quality of positive predictions.
Recall: Focuses on capturing all actual positive cases.
F1-Score: Combines precision and recall to give a balanced measure.
Confusion Matrix: A table showing correct and incorrect predictions for each class. It gives a detailed view of the model’s performance.

Measuring Accuracy in Different Classification Scenarios

1. Binary Classification

In binary classification, the model predicts one of two possible classes, such as "spam" or "not spam." Accuracy is calculated as:

\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP + TN + FP + FN}}

Example: Consider a spam filter where a high accuracy indicates the filter performs but if the dataset is imbalanced accuracy might not reveal the true performance. Precision and recall are often used alongside accuracy in such cases.

2. Multiclass Classification

In multiclass classification the model predicts one of three or more classes such as classifying handwritten digits (0–9). Accuracy in this context is defined as:

where

\text{Multiclass Accuracy}(y_i, z_i) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[y_i == z_i]

N is used as the term that refers to the number of samples.
y_i and z_i are labeled after the output has been generated or predicted as true.

Example: Confusion matrix for a multiclass classification problem:

	Predicted A	Predicted B	Predicted C	Total
Actual A	7	2	1	10
Actual B	3	5	2	10
Actual C	2	1	3	6
Total	12	8	6	31

Correct predictions: 7 + 5 + 3 = 15
Total predictions: 31

Accuracy:

\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}} = \frac{15}{31} \approx 0.483

Result: The model achieved 48.3% accuracy.

3. Multilabel Classification

In multilabel classification, a single instance can be associated with multiple classes at the same time. For instance, a news article might be labeled with categories like "Politics," "Economy," and "World."

Multilabel accuracy also referred to as the Hamming score is calculated in multilabel classification by comparing the number of correctly predicted labels to the total number of relevant labels.

\text{Accuracy}, A = \frac{1}{n} \sum_{i=1}^{n} \frac{|Y_i \cap Z_i|}{|Y_i \cup Z_i|}

n refers to the number of samples
Y_i and z_i represent the labels assigned after the output has been generated or predicted as correct.

MulticlassClassification-vs-MultilabelClassification — Multiclass Classification vs Multilabel Classification

Additional Metrics use to evaluate Machine Learning Model

1. Precision

Precision emphasizes the accuracy of positive predictions by calculating the ratio of correctly predicted positive instances to the total instances predicted as positive.

\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

Use Case: Precision is vital in scenarios where false positives are costly such as email spam filters or fraud detection systems. A high precision ensures fewer incorrect positive predictions.

Recall (Sensitivity or True Positive Rate)

Recall measures the proportion of actual positive instances that the model correctly identified. It focuses on capturing as many true positives as possible.

\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

Use Case: Recall is critical in applications where missing a positive case has severe consequences, such as disease diagnosis or safety-critical systems. A high recall ensures most positive cases are detected.

F1-Score

The F1-Score combines precision and recall into a single metric by calculating their harmonic mean. It provides a balance between the two and is particularly useful when you need to account for both false positives and false negatives.

\text{F1-Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Use Case: The F1-score is ideal for imbalanced datasets, where focusing on just precision or recall might be misleading. It ensures the model maintains a balance between detecting positive cases and avoiding false alarms.

Accuracy is a good metric to use when you want to see how well our model is classifying data especially if our dataset is balanced, meaning all classes have similar amounts of data. It gives a simple measure of how many predictions are correct out of all predictions.