An attempt to delete duplicate records in the SQL database deleted all records. Something went wrong?

I tried to delete duplicate entries in my DB. The only difference is PrimaryKey, which is a unique identifier. I have about 1500 entries that were scammed, so in all cases I go through about 3000 entries. So I split about 60 records (based on the date receivedOn) and ran my code to reduce them to 30 and OH CRAP 30 was gone! Here is the code I tried:

DELETE dupes
FROM [emailTable] dupes, [emailTable] fullTable
WHERE (dupes.ReceivedOn > '2009-08-18 23:59:59.999' AND dupes.ReceivedOn < '2009-08-20 00:00:00.000')
      AND (dupes.emlPath = fullTable.emlPath)
      AND NOT (dupes.GUID = fullTable.GUID)

      

My goal is to remove the duplicate. I don't care which one ... but I need ONE of the two entries to stay on the server ... Can anyone shed some light on what I did wrong?

+2


source to share


6 answers


You can do this without a second table. Something like that:

SELECT * FROM emailTable
WHERE EXISTS (
    SELECT * FROM emailTable AS t2
    WHERE t2.emlPath = emailTable.emlPath AND
    t2.GUID > emailTable.GUID)

      

This will show you which entries are about to be deleted. If that's good, change it to:



DELETE FROM emailTable
WHERE EXISTS (
    SELECT * FROM emailTable AS t2
    WHERE t2.emlPath = emailTable.emlPath AND
    t2.GUID > emailTable.GUID)

      

t2.GUID > emailTable.GUID

will make sure that one record with this emlPath

remains in the table.

+7


source


You have to do the delete with a subquery, not a union.

The advantage of this is that you can view the GUIDs you remove before you remove them. (just run the select query yourself)



This will do it, it will remove the smallest GUID

    delete from emailTable where GUID in
    (

     select MIN(dupe.GIUD) from emailTable dupe
       INNER JOIN emailTable noDupe 
         ON dupe.emlPath=noDupe.emlPath 
            where recievedOn between '2009-8-18' and '2009-8-20'
               GROUP BY dupe.emlPath
    )

      

+2


source


What you did wrong is that your query does not exclude any of the duplicates. It selects duplicates that differ from another duplicate in the same way, but each duplicate is different from the other duplicate.

What you need to do is first select the duplicates you want to keep, for example:

select min(GUID)
from emailTable
where ReceivedOn > '...' and ReceivedOn < '...'
group by emlPath
having count(*) > 1

      

Then you remove all duplicates except those.

+1


source


You shouldn't use "=" in your connection. those. "AND NOT (dupes.GUID = fullTable.GUID)" This condition will do nothing, because the GUID of your duplicate rows must be different.

You have to use more than. i.e

delete from emailTable 
WHERE EXISTS
(
    SELECT ID FROM emailTable t2
    WHERE emailTable.GUID > t2.GUID
    AND emailTable.emlPath= t2.emlPath
)

      

0


source


I prefer to use a generic table expression for this and ROW_NUMBER ():

with cte as (
   select row_number() over (partition by emlPath order by GUID) as eml_no
      , ReceivedOn
   from [emailTables])
delete from cte
   where eml_no > 1
   and ReceivedOn between '2009-08-18 23:59:59.999' AND '2009-08-20 00:00:00.000';

      

I prefer this because it gives shuffle control over which the duplicate row is removed. I can delete the third one and keep two, I can choose any order number I want to keep first and it is great at linking.

0


source


This was the code I got thanks to the help of all posts:

DELETE A
  FROM [emailTable] A, [emailTable] B
  WHERE A.MessageID = B.MessageID
        AND A.GUID > B.GUID

      

0


source







All Articles