Data Duplication Removal from Dataset Using Python

Last Updated : 02 Jun, 2025

Duplicates are a common issues in real-world datasets that can negatively impact our analysis. They occur when identical rows or entries appear multiple times in a dataset. Although they may seem harmless but they can cause problems in analysis if not fixed. Duplicates could happen due to:

Data entry errors: When the same information is recorded more than once by mistake.
Merging datasets: When combining data from different sources can lead to overlapping of data that can create duplicates.

Why Duplicates Can Cause Problems?

Skewed Analysis: Duplicates can affect our analysis results which leads to misleading conclusions such as an wrong average salary.
Inaccurate Models: It can cause machine learning models to overfit which reduces their ability to perform well on new data.
Increased Computational Costs: It consume extra computational power which slows down analysis and impacts workflow.
Data Redundancy and Complexity: It make it harder to maintain accurate records and organize data and adds unnecessary complexity.

Identifying Duplicates

To manage duplicates the first step is identifying them in the dataset. Pandas offers various functions which are helpful to spot and remove duplicate rows. Now we will see how to identify and remove duplicates using Python.

We will be using Pandas library for its implementation and will use a sample dataset below.

Python

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'David'],
    'Age': [25, 30, 25, 35, 30, 40],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'San Francisco']
}

df = pd.DataFrame(data)
df

Output:

1. Using `duplicated()` Method

The duplicated() method helps to identify duplicate rows in a dataset. It returns a boolean Series indicating whether a row is a duplicate of a previous row.

Python

duplicates = df.duplicated()

duplicates

Output:

2. Using drop_duplicates() method

The drop_duplicates() method remove duplicates from a DataFrame in Python. This method removes duplicate rows based on all columns by default or specific columns if required.

Python

df_no_duplicates = df.drop_duplicates()

(df_no_duplicates)

Output:

Removing Duplicates

Duplicates may appear in one or two columns instead of the entire dataset. In such cases, we can choose specific columns to check for duplicates.

1. Based on Specific Columns

Here we will specify columns i.e name and city to remove duplicates using drop_duplicates() .

Python

df_no_duplicates_columns = df.drop_duplicates(subset=['Name', 'City'])
(df_no_duplicates_columns)

Output:

2. Keeping the First or Last Occurrence

By default drop_duplicates() keeps the first occurrence of each duplicate row. However, we can adjust it to keep the last occurrence instead.

Python

df_keep_last = df.drop_duplicates(keep='last')
(df_keep_last)

Output:

Cleaning duplicates is an important step in ensuring data accuracy which improves model performance and optimizing analysis efficiency.

Data Duplication Removal from Dataset Using Python

oceanofknow6flv

Improve

Article Tags :

Practice Tags :

python

Data Duplication Removal from Dataset Using Python

Why Duplicates Can Cause Problems?

Identifying Duplicates

1. Using duplicated() Method

2. Using drop_duplicates() method

Removing Duplicates

1. Based on Specific Columns

2. Keeping the First or Last Occurrence

Similar Reads

Thank You!

What kind of Experience do you want to share?

1. Using `duplicated()` Method