Remove duplicates - only everything except the very last line
Possible duplicate:
How to find duplicate records and remove the oldest ones from SQL?
I have a database with several thousand duplicates due to a faulty update tool. I can identify collections of items with duplicates, but you only need to delete the oldest records, not the lowest identifier. The test data looks like this: the correct line has *
Articles with duplicate titles that do not have duplicate rules should be removed except for the most recently created lines. (the actual id column is the GUID, so I can't assume auto-incrementing)
Id Article id Rule Id Title Opened Date
-- ---------- ------- ----- -----------
1* 111 5 T1 2013-01-20
2 112 5 T1 2013-07-01
3* 113 6 T2 2013-07-01
4* 114 7 T2 2013-07-02
5 115 8 T3 2012-07-01
6 116 8 T3 2013-01-20
7* 117 8 T3 2013-01-21
Table schema:
CREATE TABLE [dbo].[test_ai](
[id] [int] NOT NULL,
[ArticleId] [varchar](50) NOT NULL,
[ruleid] [varchar](50) NULL,
[Title] [nvarchar](max) NULL,
[AuditData_WhenCreated] [datetime] NULL,
PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
Test data inserts
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (1, 111, 5, 'test 1', '2013-01-20')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (2, 112, 5, 'test 1', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (3, 113, 6, 'test 2', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (4, 114, 7, 'test 2', '2012-07-02')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (5, 115, 8, 'test 3', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (6, 116, 8, 'test 3', '2013-01-20')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (7, 117, 8, 'test 3', '2013-01-21')
My current request looks like
select * from test_ai
where test_ai.id in
-- set 1 - all rows with duplicates
(select f.id
from test_ai as F
WHERE exists (select ruleid, title, count(id)
FROM test_ai
WHERE test_ai.title = F.title
AND test_ai.ruleid = F.ruleid
GROUP BY test_ai.title, test_ai.ruleid
having count(test_ai.id) > 1))
and test_ai.id not in
-- set 2 - includes one row from each set of duplicates
(select min(id)
from test_ai as F
WHERE EXISTS (select ruleid, title, count(id)
from test_ai
WHERE test_ai.title = F.title
AND test_ai.ruleid = F.ruleid
group by test_ai.title, test_ai.ruleid
HAVING count(test_ai.id) > 1)
GROUP BY title, ruleid
)
This SQL identifies some of the rows to be deleted (lines 2,6,7), but it selects the oldest entry by "open date". (should delete lines 2,5,6). I realize I am not pointing this to the statement, but am afraid how to add this last snippet. If this results in a script that I need to run more than once to remove duplicates when there is more than one duplicate, that's not a problem.
The real problem is much more complicated, but if I can get past this blocking part, I can move forward again. Thanks for watching!
source to share
A typical model for removing one row from a set (or from every group in a set) in SQL Server 2005+:
;WITH cte AS
(
SELECT col, rn = ROW_NUMBER() OVER
(PARTITION BY something ORDER BY something)
FROM dbo.base_table
WHERE ...
)
DELETE x WHERE rn = 1;
In your case, it would be:
;WITH cte AS
(
SELECT id, ruleid, Title, rn = ROW_NUMBER() OVER
(
PARTITION BY ruleid, Title
ORDER BY auditdata_whencreated DESC
)
FROM dbo.test_ai
)
DELETE cte
OUTPUT deleted.id
WHERE rn > 1;
Results:
id
----
2
6
5
source to share