Deleting Duplicate RecordsBy Bill Graziano on 26 March 2001 | Tags: DELETEs Seema writes "There is a Table with no key constraints. It has duplicate records. The duplicate records have to be deleted (eg there are 3 similar records, only 2 have to be deleted). I need a single SQL query for this." This is a pretty common question so I thought I'd provide some options.
First, I'll need some duplicates to work with. I use this script to create a table called dup_authors in the pubs database. It selects a subset of the columns and creates some duplicate records. At the end it runs a SELECT statement to identify the duplicate records:
select au_lname, au_fname, city, state, count(*) from dup_authors group by au_lname, au_fname, city, state having count(*) > 1 order by count(*) desc, au_lname, au_fname The easiest way I know of to identify duplicates is to do a GROUP BY on all the columns in the table. It can get a little cumbersome if you have a large table. My duplicates look something like this: au_lname au_fname city state --------------- ---------- -------------------- ----- ----------- Smith Meander Lawrence KS 3 Bennet Abraham Berkeley CA 2 Carson Cheryl Berkeley CA 2 except there are thirteen additional duplicates identified. Second, backup your database. Third, make sure you have a good backup of your database. Temp Table and TruncateThe simplest way to eliminate the duplicate records is to SELECT DISTINCT into a temporary table, truncate the original table and SELECT the records back into the original table. That query looks like this: select distinct * into #holding from dup_authors truncate table dup_authors insert dup_authors select * from #holding drop table #holding If this is a large table, it can quickly fill up your tempdb. This also isn't very fast. It makes a copy of your data and then makes another copy of your data. Also while this script is running, your data is unavailable. It may not be the best solution but it certainly works. Rename and Copy BackThe second option is to rename the original table to something else, and copy the unique records into the original table. That looks like this:sp_rename 'dup_authors', 'temp_dup_authors' select distinct * into dup_authors from temp_dup_authors drop table temp_dup_authors This has a couple of benefits over the first option. It doesn't use tempdb and it only makes one copy of the data. On the downside, you'll need to rebuild any indexes or constraints on the table when you're done. This one also makes the data unavailable during the process. Create a Primary KeyOur last option is more complex. It has the benefit of not making a copy of the data and only deleting the records that are duplicates. It's main drawback is that we have to alter the original table and add a sequential record number field to uniquely identify each record. That script looks like this: -- Add a new column -- In real life I'd put an index on it Alter table dup_authors add NewPK int NULL go -- populate the new Primary Key declare @intCounter int set @intCounter = 0 update dup_authors SET @intCounter = NewPK = @intCounter + 1 -- ID the records to delete and get one primary key value also -- We'll delete all but this primary key select au_lname, au_fname, city, state, RecCount=count(*), PktoKeep = max(NewPK) into #dupes from dup_authors group by au_lname, au_fname, city, state having count(*) > 1 order by count(*) desc, au_lname, au_fname -- delete dupes except one Primary key for each dup record delete dup_authors from dup_authors a join #dupes d on d.au_lname = a.au_lname and d.au_fname = a.au_fname and d.city = a.city and d.state = a.state where a.NewPK not in (select PKtoKeep from #dupes) -- remove the NewPK column ALTER TABLE dup_authors DROP COLUMN NewPK go drop table #dupes It's actually possible to combine the SELECT INTO #dupes and the DELETE into one DELETE statement. My script is easier to read and understand and shouldn't be much slower. This will run a single delete statement against your table and only remove the duplicate records. If you have a large table the join statement can get kind of large. I guess that's the price you pay for letting duplicates into your database. We have whole section on Database Design. If you find yourself deleting duplicates I'd take a look at some of the articles. I'd also think about creating some Primary Keys that enforce uniqueness.
|
- Advertisement - |