Delete Duplicate Rows in MS SQL Server
In MS SQL Server, managing duplicate rows is a common task that can affect the integrity and performance of a database. To address this issue, SQL Server provides several methods for identifying and deleting duplicate rows.
In this article, We will explore three effective approaches: using the GROUP BY
and HAVING
clause, Common Table Expressions (CTE) and the RANK()
function.
How to Delete Duplicate Rows in SQL Server?
To Delete Duplicate Rows in MS SQL Server, we will use the below method that helps us to perform delete duplicate rows from the table as defined below:
To delete duplicate rows in MS SQL Server use the following syntax:
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY [column1], [column2], ...
ORDER BY [column1], [column2], ...) AS RowNumber
FROM [table_name]
)
DELETE FROM CTE
WHERE RowNumber > 1;
Different Ways To Find And Delete Duplicate Rows From A Table In SQL Server
Let us look at an example, and learn how to delete duplicate rows in MS SQL Server. First, let's create a table and insert some duplicate rows in the table. Let us create a table named Geek:
CREATE TABLE Geek(
Name NVARCHAR(100) NOT NULL,
Email NVARCHAR(255) NOT NULL,
City NVARCHAR(100) NOT NULL
);
INSERT INTO Geek (Name, Email, City) VALUES
('Nisha', 'nisha@gfg.com', 'Delhi'),
('Megha', 'megha@gfg.com', 'Noida'),
('Khushi', 'khushi@gfg.com', 'Jaipur'),
('Khushi', 'khushi@gfg.com', 'Jaipur'),
('Khushi', 'khushi@gfg.com', 'Jaipur'),
('Hina', 'hina@gfg.com', 'Kanpur'),
('Hina', 'hina@gfg.com', 'Kanpur'),
('Misha', 'misha@gfg.com', 'Gurugram'),
('Misha', 'misha@gfg.com', 'Gurugram'),
('Neha', 'neha@gfg.com', 'Pilani');
1. Using GROUP BY
and HAVING
Clause
Problem Statement:
Remove duplicate rows from the Geek
table, keeping only one entry for each unique combination of Name
, Email
, and City
.
SQL Query:
-- Select distinct rows to find duplicates
SELECT Name, Email, City
FROM Geek
GROUP BY Name, Email, City
HAVING COUNT(*) > 1;
Output:
Name | City | |
---|---|---|
Khushi | khushi@gfg.com | Jaipur |
Hina | hina@gfg.com | Kanpur |
Misha | misha@gfg.com | Gurugram |
Explanation:
- The
GROUP BY
clause groups rows byName
,Email
, andCity
. - The
HAVING COUNT(*) > 1
clause filters groups where there are more than one occurrence, indicating duplicates.
2. Using Common Table Expressions (CTE)
Problem Statement:
Remove duplicate rows from the Geek
table while retaining only the first occurrence of each unique combination of Name
, Email
, and City
.
SQL Query:
WITH CTE AS (
SELECT
Name,
Email,
City,
ROW_NUMBER() OVER (PARTITION BY Name, Email, City ORDER BY (SELECT NULL)) AS rn
FROM Geek
)
-- Delete duplicates where the row number is greater than 1
DELETE FROM CTE
WHERE rn > 1;
This query does not produce a direct output but removes the duplicate entries.
Explanation:
- The
ROW_NUMBER()
function assigns a unique sequential integer to rows within a partition ofName
,Email
, andCity
. PARTITION BY
specifies the columns used to identify duplicates.ORDER BY (SELECT NULL)
allows row numbers to be assigned arbitrarily.- The
DELETE
statement removes rows wherern > 1
, keeping only the first occurrence.
3. Using RANK()
Function
Problem Statement:
Remove duplicate rows from the Geek
table while retaining only the row with the highest rank for each unique combination of Name
, Email
, and City
.
SQL Query:
WITH RankedCTE AS (
SELECT
Name,
Email,
City,
RANK() OVER (PARTITION BY Name, Email, City ORDER BY (SELECT NULL)) AS rnk
FROM Geek
)
-- Delete duplicates where the rank is greater than 1
DELETE FROM Geek
WHERE EXISTS (
SELECT 1
FROM RankedCTE
WHERE RankedCTE.Name = Geek.Name
AND RankedCTE.Email = Geek.Email
AND RankedCTE.City = Geek.City
AND RankedCTE.rnk > 1
);
Output:
This query does not produce a direct output but removes the duplicate entries.
Explanation:
- The
RANK()
function assigns a rank to each row within a partition ofName
,Email
, andCity
. - Rows with the same rank will have the same value within the partition.
- The
DELETE
statement removes rows wherernk > 1
, ensuring only the highest-ranked row (first occurrence) is kept.
Conclusion
Deleting duplicate rows in SQL Server is crucial for maintaining clean and accurate datasets. The methods outlined—using GROUP BY
with HAVING
, Common Table Expressions (CTE), and the RANK()
function—offer versatile solutions for different scenarios. Whether you need to identify duplicates, remove them while retaining the first occurrence, or prioritize rows based on ranking, SQL Server provides robust tools to achieve these goals