SQL Removing duplicates one row at a time
I have a table where I save all the row changes that have ever happened. The problem is that there was an error at the start of the application that was creating a bunch of copies of each line.
The table looks something like this:
copies
|ID |CID |DATA
| 1 | 1 | DA
| 2 | 2 | DO
| 2 | 3 | DO (copy of CID 2)
| 1 | 4 | DA (copy of CID 1)
| 2 | 5 | DA
| 1 | 6 | DA (copy of CID 1)
| 2 | 7 | DO
CID UNIQUE in tabular copies.
I want to remove all duplicate DATA GROUP BY IDs that are sorted by CID.
As you can see in the table, CIDs 2 and 3 are the same and they are next to each other. I would like to remove CID 3. Same with CID 4 and CID 6; they have no ID 1 in between and are copies of CID 1.
After removing the duplicates, I would like the table to look like this:
copies
|ID |CID |DATA
| 1 | 1 | DA
| 2 | 2 | DO
| 2 | 5 | DA
| 2 | 7 | DO
Any suggestions?:)
I think my question was heavily asked because the answer that seems to be the best gives this result:
ID | DATA | DATA | DATA | DATA | DATA | DATA | CID |
|Expected | Quassnoi |
1809 | 1 | 0 | 1 | 0 | 0 | NULL | 252227 | 252227 |
1809 | 1 | 0 | 1 | 1 | 0 | NULL | 381530 | 381530 |
1809 | 1 | 0 | 1 | 0 | 0 | NULL | 438158 | (missing) |
1809 | 1 | 0 | 1 | 0 | 1535 | 20090113 | 581418 | 581418 |
1809 | 1 | 1 | 1 | 0 | 1535 | 20090113 | 581421 | 581421 |
CID 252227 AND CID 438158 are duplicates, but since CID 381530 is in between; I want to keep this. These are only duplicates, which are located directly one after another when ordering by CID and ID.
source to share
DELETE c.*
FROM copies c
JOIN (
SELECT id, data, MIN(copies) AS minc
FROM copies
GROUP BY
id, data
) q
ON c.id = q.id
AND c.data = q.data
AND c.cid <> q.minc
Update:
DELETE c.*
FROM (
SELECT cid
FROM (
SELECT cid,
COALESCE(data1 = @data1 AND data2 = @data2, FALSE) AS dup,
@data1 := data1,
@data2 := data2
FROM (
SELECT @data1 := NULL,
@data2 := NULL
) vars, copies ci
ORDER BY
id, cid
) qi
WHERE dup
) q
JOIN copies c
ON c.cid = q.cid
This is empoys solution of session MySQL
variables.
There is a clean solution ANSI
that will use NOT EXISTS
, however it will be slow due to the way the optimizer works MySQL
(it will not use the accessor range
in the correlated subquery).
See this article on my blog for details on performance for a fairly similar task:
source to share
// EDITED for @Jonathan Leffler comment
//$sql = "SELECT ID,CID,DATA FROM copies ORDER BY CID, ID";
$sql = "SELECT ID,CID,DATA FROM copies ORDER BY ID, CID";
$result = mysql_query($sql, $link);
$data = "";
$id = "";
while ($row = mysql_fetch_row($result)){
if (($row[0]!=$id) && ($row[2]!=$data) && ($id!="")){
$sql2 = "DELETE FROM copies WHERE CID=".$row[1];
$res = mysql_query($sql2, $link);
}
$id=$row[0];
$data=$row[2];
}
source to share