An attempt to delete duplicate records in the SQL database deleted all records. Something went wrong?
I tried to delete duplicate entries in my DB. The only difference is PrimaryKey, which is a unique identifier. I have about 1500 entries that were scammed, so in all cases I go through about 3000 entries. So I split about 60 records (based on the date receivedOn) and ran my code to reduce them to 30 and OH CRAP 30 was gone! Here is the code I tried:
DELETE dupes
FROM [emailTable] dupes, [emailTable] fullTable
WHERE (dupes.ReceivedOn > '2009-08-18 23:59:59.999' AND dupes.ReceivedOn < '2009-08-20 00:00:00.000')
AND (dupes.emlPath = fullTable.emlPath)
AND NOT (dupes.GUID = fullTable.GUID)
My goal is to remove the duplicate. I don't care which one ... but I need ONE of the two entries to stay on the server ... Can anyone shed some light on what I did wrong?
source to share
You can do this without a second table. Something like that:
SELECT * FROM emailTable
WHERE EXISTS (
SELECT * FROM emailTable AS t2
WHERE t2.emlPath = emailTable.emlPath AND
t2.GUID > emailTable.GUID)
This will show you which entries are about to be deleted. If that's good, change it to:
DELETE FROM emailTable
WHERE EXISTS (
SELECT * FROM emailTable AS t2
WHERE t2.emlPath = emailTable.emlPath AND
t2.GUID > emailTable.GUID)
t2.GUID > emailTable.GUID
will make sure that one record with this emlPath
remains in the table.
source to share
You have to do the delete with a subquery, not a union.
The advantage of this is that you can view the GUIDs you remove before you remove them. (just run the select query yourself)
This will do it, it will remove the smallest GUID
delete from emailTable where GUID in
(
select MIN(dupe.GIUD) from emailTable dupe
INNER JOIN emailTable noDupe
ON dupe.emlPath=noDupe.emlPath
where recievedOn between '2009-8-18' and '2009-8-20'
GROUP BY dupe.emlPath
)
source to share
What you did wrong is that your query does not exclude any of the duplicates. It selects duplicates that differ from another duplicate in the same way, but each duplicate is different from the other duplicate.
What you need to do is first select the duplicates you want to keep, for example:
select min(GUID)
from emailTable
where ReceivedOn > '...' and ReceivedOn < '...'
group by emlPath
having count(*) > 1
Then you remove all duplicates except those.
source to share
You shouldn't use "=" in your connection. those. "AND NOT (dupes.GUID = fullTable.GUID)" This condition will do nothing, because the GUID of your duplicate rows must be different.
You have to use more than. i.e
delete from emailTable
WHERE EXISTS
(
SELECT ID FROM emailTable t2
WHERE emailTable.GUID > t2.GUID
AND emailTable.emlPath= t2.emlPath
)
source to share
I prefer to use a generic table expression for this and ROW_NUMBER ():
with cte as (
select row_number() over (partition by emlPath order by GUID) as eml_no
, ReceivedOn
from [emailTables])
delete from cte
where eml_no > 1
and ReceivedOn between '2009-08-18 23:59:59.999' AND '2009-08-20 00:00:00.000';
I prefer this because it gives shuffle control over which the duplicate row is removed. I can delete the third one and keep two, I can choose any order number I want to keep first and it is great at linking.
source to share