Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 remove dups, keep the one with most data

Author  Topic 

WJHamel
Aged Yak Warrior

651 Posts

Posted - 2012-08-28 : 10:46:23
So i have a table with duplicate entries. I want to keep only ONE of each entry with a common ECSOID value, but the ONE row i want to keep is the ONE with the most data in all of it's columns. Typically, i use something like this JUST to get rid of duplicates, but it contains no where clause keeping the row that i want:

this query is from an unrelated table:

Delete d FROM
(
SELECT personid, firstname, lastname, middlename, ROW_NUMBER()OVER(ORDER BY personid) AS RowNum
FROM persondata)D
JOIN
(
SELECT personid, firstname, lastname, middlename, ROW_NUMBER()OVER(ORDER BY personid) AS RowNum
FROM persondata)E
on D.personid = E.personid
AND D.RowNum < E.RowNum
delete FROM persondata where Firstname='' AND lastname='' AND dob IS NULL



here's a snapshot from the table i want to deal with. it has many other columns than those you see here:


ECSOID DOB Race SexID Hair eyes
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 NULL
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 BLN BRO
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 BLN BRO
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 GRY BRO
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 BLN BRO
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 BLN BRO
DCSO00MNI147038 1972-10-29 00:00:00.000 W 14 BRO BRO
DCSO00MNI147038 1972-10-29 00:00:00.000 W 14 BRO BRO
DCSO00MNI147038 1972-10-29 00:00:00.000 W 14 BRO BRO
DCSO00MNI147038 1972-10-29 00:00:00.000 W 14 BRO BRO
DCSO00MNI147038 1972-10-29 00:00:00.000 W 14 BRO BRO
DCSO00MNI147038 1972-10-29 00:00:00.000 W 14 BRO BRO
DCSO00MNI147038 1972-10-29 00:00:00.000 W 14 BRO BRO
DCSO00MNI147038 1972-10-29 00:00:00.000 W 14 BRO BRO
DCSO00MNI147279 1940-07-07 00:00:00.000 B 14 BLK BRO
DCSO00MNI147279 1940-07-07 00:00:00.000 B 14 BLK BRO
DCSO00MNI147279 1940-07-07 00:00:00.000 B 14 BLK BRO
DCSO00MNI147279 1940-07-07 00:00:00.000 B 14 BLK BRO
DCSO00MNI147279 1940-07-07 00:00:00.000 B 14 BLK BRO
DCSO00MNI147279 1941-07-07 00:00:00.000 B 14 BLK BRO
DCSO00MNI147279 1940-07-07 00:00:00.000 B 14 BLK BRO
DCSO00MNI147279 1940-07-07 00:00:00.000 B 14 BLK BRO
DCSO00MNI151477 1956-12-21 00:00:00.000 W 14 BRO GRN
DCSO00MNI151477 1956-12-21 00:00:00.000 W 14 BRO
DCSO00MNI151477 1956-12-21 00:00:00.000 W 14 BRO GRN



Is there a modification i can apply to the query above which will leave intact what i want?

robvolk
Most Valuable Yak

15732 Posts

Posted - 2012-08-28 : 10:55:19
quote:
it has many other columns than those you see here
Well, how many more? Are certain columns more important than others? E.g., if there are 2 rows that have 46 out of 50 columns filled in, but they are 46 different columns, which row should be kept?
Go to Top of Page

WJHamel
Aged Yak Warrior

651 Posts

Posted - 2012-08-28 : 11:06:44
i can identify about 20 columns amongst the 60 or so that are available that i need to include in that level of priority.
Go to Top of Page

WJHamel
Aged Yak Warrior

651 Posts

Posted - 2012-08-28 : 11:32:53
Further reflection leads me to think i need to add a where clause stating
Where d.DOB is not null or d.dbo <>''
and d.race is not null or d.race<>''

and so on and on and on for all the columns i want examined.

Yes? No?
Go to Top of Page

robvolk
Most Valuable Yak

15732 Posts

Posted - 2012-08-28 : 12:03:03
That's one way of doing it, and possibly the most efficient, but as usual it's impossible to say without the full schema, sample data and expected results. It would help if you could research a bit and find some edge cases where 2 or more rows are very close to one another and need to be weighted somehow. For instance, in your original sample:
ECSOID	          DOB	               Race	SexID	Hair	eyes
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 BLN BRO
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 BLN BRO
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 GRY BRO
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 BLN BRO
DCSO00MNI144747 1939-11-15 00:00:00.000 W 15 BLN BRO
All the rows are identical except the hair color on that one red row. Should it be discarded? I'd think a person born in 1939 is more likely to have gray hair than blonde hair; however that's guesswork, and a standard data analysis would discard it as an outlier.
Go to Top of Page

WJHamel
Aged Yak Warrior

651 Posts

Posted - 2012-08-29 : 11:57:24
Ultimately, the task at hand was deemed unnecessary. Thanks for the input nonetheless.
Go to Top of Page
   

- Advertisement -