DeDuping a Large Table - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2012 Forums

Transact-SQL (2012)

DeDuping a Large Table

Author

Topic

sashak87
Starting Member

1 Post

Posted - 2014-03-13 : 10:00:48

Hi all,

I'm new to SQL so please go easy.

I've got a table which has ten columns and roughly 120m rows which I need to dedupe. There is no column that is unique to each row and any dupes will be duplicated entire rows. There will be many instances of rows where 8 or 9 values are the same but one or two are different.

What is the quickest way to dedupe the above in as few steps as possible?

Off the top of my head the only way I can think is to combine the ten columns into one long string, count occurrences of each string and then somehow dedupe off that? Is there a more efficient way?

Thanks!

MIK_2008
Master Smack Fu Yak Hacker

1054 Posts

Posted - 2014-03-13 : 10:30:20

may be checkSum can help you in that case. But you would like to read and understand the BOL for this function in order to understand and see if it can fully fit in your scenario (based on the data set you have)

e.g. Lets say there are four columns in the table and has no uniqueness is not imposed. In order to check the duplicate rows
SELECT *
FROM TableName
WHERE CheckSum(Col1,Col2,Col3,Col4) IN ( --the below inner query will result the duplicated records based on all columns
SELECT CheckSum(Col1,Col2,Col3,Col4),count(1)
FROM TableName
Group by CheckSum(Col1,Col2,Col3,Col4) having count(1) >1
)

Cheers
MIK

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2014-03-14 : 14:06:48

What are the rules to de-dup?

If it is just the entire row is the same, then I'd suggest just using the ROW_NUMBER() function to partitoion by each column and DELETE where the "row num" is greater than 1.

cake.

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2014-03-14 : 14:16:09

Here is a quick sample:

DECLARE @Foo TABLE (Val1 INT, Val2 INT, Val3 INT)

INSERT @Foo VALUES

(1, NULL, 1),
(1, NULL, 1),
(1, NULL, 1),
(1, NULL, 1),
(1, 1, 1),
(1, 2, 2),
(1, 2, 2),
(1, 2, 3),
(1, 2, 3),
(1, 2, 2),
(2, 2, 2),
(2, 2, 2)

DELETE 
	T
FROM
	(
		SELECT
			ROW_NUMBER() OVER (PARTITION BY Val1, Val2, Val3 ORDER BY Val1) RowNum
		FROM
			@Foo
	) AS T
WHERE
	RowNum > 1

SELECT * FROM @Foo

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources