Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 Remove duplicates Query

Author  Topic 

Mattneedshelp
Starting Member

3 Posts

Posted - 2012-05-14 : 09:41:05
Hi,

Here is my problem:

We have a job at work that takes data from a staging table, places it all into a text file, and then transfers it to another system. If there are duplicate rows in the text file, the data transfer fails because duplicate data can't be placed into the system.

But here is the problem, its not all of the row that's duplicated. So for example you have the tracking number, the action code, and the delivery time, but the action code can be whatever it likes, so if -

Tracking no. Action code delivery time
12345 C 2308
12345 B 2308

If that occured, the job would fail. The action code is effectively ignored, but you can't have 2 instances of the same delivery time and tracking number.

We currently run some SQL whereby it does a count of everything in the database, and brings back all of the duplicated rows. We then take the tracking numbers, and one by one have to do a 'find' in the .txt file, and remove each duplicate row. Bearing in mind most days there are about 80 duplicates, this whole process can take an hour.

I'm thinking there MUST be a way of saying 'show me a distinct count of the tracking number + the delivery time in column a, and in column b show me the total number of rows for each tracking number + delivery time. Where there is more than 1 row, delete the appropriate number of rows to make it unique'.

Is this far too complicated? Im hoping that if something like this was available, it could be done as an automated sql process in our batch schedule in the middle of the process, so there'd be a duplicate check before the data transfer is even attempted.

I hope this makes sense, and I hope someone can help.

Thank you!

madhivanan
Premature Yak Congratulator

22864 Posts

Posted - 2012-05-14 : 09:53:33

This will get you duplicates

select [Tracking no],[delivery time] from table
group by [Tracking no],[delivery time]
having count(*)>1

Madhivanan

Failing to plan is Planning to fail
Go to Top of Page

Mattneedshelp
Starting Member

3 Posts

Posted - 2012-05-14 : 10:23:47
Yeah, I have no problem finding the duplicates, it's the actual removal of them.

Say for example I ran the code below -

SELECT COUNT(*) AS Total, TrackingNumber + DeliveryTime AS Details FROM Trackingtable
GROUP BY TrackingNumber + DeliveryTime
ORDER BY COUNT(*) DESC

What code could I include after that would say where 'Total' > 1, delete the duplicate rows to get the count back to 1?

Go to Top of Page

madhivanan
Premature Yak Congratulator

22864 Posts

Posted - 2012-05-14 : 10:29:35
Take a data backup for that table before running this

delete t1 from table as t1 inner join
(
select [Tracking no],[delivery time] from table
group by [Tracking no],[delivery time]
having count(*)>1
) as t2
on t1.[Tracking no]=t2.[Tracking no] and t1.[delivery time] =t2.[delivery time]

Madhivanan

Failing to plan is Planning to fail
Go to Top of Page

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-05-14 : 10:30:50
quote:
Originally posted by Mattneedshelp

Yeah, I have no problem finding the duplicates, it's the actual removal of them.

Say for example I ran the code below -

SELECT COUNT(*) AS Total, TrackingNumber + DeliveryTime AS Details FROM Trackingtable
GROUP BY TrackingNumber + DeliveryTime
ORDER BY COUNT(*) DESC

What code could I include after that would say where 'Total' > 1, delete the duplicate rows to get the count back to 1?



You can do it a few different ways. My favorite is using a CTE - see below. First run the code as it is; that will show you the rows that are going to be deleted. Then, uncomment the last line and comment the line above that and run to do the actual delete. As always, please test in a test environment before you pull the trigger:
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [Tracking no],[delivery time] ORDER BY Actioncode) AS RN
FROM
YourTable
)
SELECT * FROM cte WHERE RN > 1;
--DELETE FROM cte WHERE RN > 1;
Go to Top of Page

Mattneedshelp
Starting Member

3 Posts

Posted - 2012-05-14 : 10:33:02
Thanks Sunitabeck, are there alternative methods should that one not work? What is a CTE, i've never come across that before.

Also, with the delete line, if the row is 2, and I run the delete line, is it not deleting both rows, or is it only deleting the one?
Go to Top of Page

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-05-14 : 11:21:09
quote:
Originally posted by Mattneedshelp

Thanks Sunitabeck, are there alternative methods should that one not work? What is a CTE, i've never come across that before.

Also, with the delete line, if the row is 2, and I run the delete line, is it not deleting both rows, or is it only deleting the one?

There are some examples of other methods here (in the dowloadable zip file): http://www.sqlmag.com/article/tsql3/deleting-duplicate-records

CTE is acronym for Common Table Expression. It is sort of like a sub query in this context; but CTE's are much more than just a subquery written on top. http://msdn.microsoft.com/en-US/library/ms190766(v=SQL.90).aspx

To answer your question - yes, it will delete only one row if there are two rows as in your example. Run the select query and you will see which rows are going to be delete. Remove the where clause and you will see all the rows.
Go to Top of Page

madhivanan
Premature Yak Congratulator

22864 Posts

Posted - 2012-05-16 : 04:25:30
Have you tried my method too?

Madhivanan

Failing to plan is Planning to fail
Go to Top of Page
   

- Advertisement -