Site Sponsored By: SQLDSC - SQL Server Desired State Configuration
Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Hi all,I'm new to SQL so please go easy.I've got a table which has ten columns and roughly 120m rows which I need to dedupe. There is no column that is unique to each row and any dupes will be duplicated entire rows. There will be many instances of rows where 8 or 9 values are the same but one or two are different.What is the quickest way to dedupe the above in as few steps as possible?Off the top of my head the only way I can think is to combine the ten columns into one long string, count occurrences of each string and then somehow dedupe off that? Is there a more efficient way?Thanks!
MIK_2008
Master Smack Fu Yak Hacker
1054 Posts
Posted - 2014-03-13 : 10:30:20
may be checkSum can help you in that case. But you would like to read and understand the BOL for this function in order to understand and see if it can fully fit in your scenario (based on the data set you have) e.g. Lets say there are four columns in the table and has no uniqueness is not imposed. In order to check the duplicate rows SELECT * FROM TableName WHERE CheckSum(Col1,Col2,Col3,Col4) IN ( --the below inner query will result the duplicated records based on all columns SELECT CheckSum(Col1,Col2,Col3,Col4),count(1)FROM TableNameGroup by CheckSum(Col1,Col2,Col3,Col4) having count(1) >1)CheersMIK
Lamprey
Master Smack Fu Yak Hacker
4614 Posts
Posted - 2014-03-14 : 14:06:48
What are the rules to de-dup?If it is just the entire row is the same, then I'd suggest just using the ROW_NUMBER() function to partitoion by each column and DELETE where the "row num" is greater than 1.cake.
Lamprey
Master Smack Fu Yak Hacker
4614 Posts
Posted - 2014-03-14 : 14:16:09
Here is a quick sample:
DECLARE @Foo TABLE (Val1 INT, Val2 INT, Val3 INT)INSERT @Foo VALUES(1, NULL, 1),(1, NULL, 1),(1, NULL, 1),(1, NULL, 1),(1, 1, 1),(1, 2, 2),(1, 2, 2),(1, 2, 3),(1, 2, 3),(1, 2, 2),(2, 2, 2),(2, 2, 2)DELETE TFROM ( SELECT ROW_NUMBER() OVER (PARTITION BY Val1, Val2, Val3 ORDER BY Val1) RowNum FROM @Foo ) AS TWHERE RowNum > 1SELECT * FROM @Foo