Finding Header and Tail Events in SQL Server (Optimization)
I have an event table, I need to find all type 1 tail events and all type 1 head events.
So, for a set of events in this order [1, 1], 3, 1, 4, 5, [1,1,1], the parentheses denote head and tail events of type 1.
This is better illustrated in SQL:
drop table #event
go
create table #event (group_id int, [date] datetime, [type] int)
create index idx1 on #event (group_id, date)
insert into #event values (1, '2000-01-01', 1)
insert into #event values (1, '2000-01-02', 1)
insert into #event values (1, '2000-01-03', 3)
insert into #event values (1, '2000-01-04', 2)
insert into #event values (1, '2000-01-05', 1)
insert into #event values (2, '2000-01-01', 2)
insert into #event values (2, '2000-01-02', 2)
insert into #event values (2, '2000-01-03', 3)
insert into #event values (2, '2000-01-04', 2)
insert into #event values (2, '2000-01-05', 1)
insert into #event values (3, '2000-01-01', 1)
insert into #event values (3, '2000-01-02', 2)
insert into #event values (3, '2000-01-03', 1)
insert into #event values (3, '2000-01-04', 2)
insert into #event values (3, '2000-01-05', 2)
insert into #event values (4, '2000-01-01', 2)
insert into #event values (4, '2000-01-02', 2)
insert into #event values (4, '2000-01-03', 3)
insert into #event values (4, '2000-01-04', 1)
insert into #event values (4, '2000-01-05', 1)
go
select e1.* from #event e1
where (
not exists (
select top 1 1
from #event e2
where e1.group_id = e2.group_id
and e2.date < e1.date
and e2.type <> 1
) or not exists (
select top 1 1
from #event e2
where e1.group_id = e2.group_id
and e2.date > e1.date
and e2.type <> 1
)
)
and e1.type = 1
Expected results:
1 2000-01-01 00:00:00.000 1
1 2000-01-02 00:00:00.000 1
1 2000-01-05 00:00:00.000 1
2 2000-01-05 00:00:00.000 1
3 2000-01-01 00:00:00.000 1
4 2000-01-04 00:00:00.000 1
4 2000-01-05 00:00:00.000 1
This all works fine and returns my expected results, but it is scanning the table 3 times. Is there a way to do this faster and reduce the number of table scans?
source to share
To create a large subset of your data, you can use this:
declare @i int
set @i = 10000
while @i > 5
begin
insert into #event values (@i, '2000-01-01', 1)
insert into #event values (@i, '2000-01-02', 1)
insert into #event values (@i, '2000-01-03', 3)
insert into #event values (@i, '2000-01-04', 2)
insert into #event values (@i, '2000-01-05', 1)
set @i = @i -1
end
Also, to include many events in a group, try this:
declare @j int
set @j = 0
while @j < 10
begin
set nocount on
declare @i int
set @i = 0
while @i < 10000
begin
insert into #event values (@j, DateAdd(d, @i, '2000-01-01'), rand(10) * 10)
set @i = @i +1
end
set @j = @j + 1
end
set nocount off
In all of my testing, it seems that my original query only produces 3 table scans, and I'm not sure if there is any performance improvement here.
source to share
From what I understand, you are after that - head and tail ordered by day ** for each id **. The head and tail are all records until a record is encountered where the type is not one.
This is another way to do it, it could be faster
;WITH Ranked AS (
SELECT
*,
Row_Number() OVER (ORDER BY date) as 'rnk'
FROM #event
)
SELECT *
FROM Ranked
WHERE rnk not between
(SELECT Min(rnk) FROM Ranked r WHERE r.type <> 1 AND ranked.id = r.id)
AND (SELECT Max(rnk) FROM Ranked r WHERE r.type <> 1 AND ranked.id = r.id)
order by id
You don't need to use TOP with the exist operator, although it doesn't hurt.
source to share
Basically it looks like you want to get a type 1 event that is less than the timestamp than the other events and more than the timestamp of a different date Try this, written in Oracle Syntax, not sure about MSSQL.
select e1. * from e1, where e1.id = 1 and (e1.date <= (select min (e2.date) from e2, where e2.id <> 1 group by e2.date) or (e1.date> = select max (e3.date) from e3, where e3.id <> 1 group by e3.date))
source to share
Consider
create idx2 index on #event (type)
I don't have a SQL Server to check, but in Oracle it will eliminate the top level scan (for the "type = 1" condition).
As for the query itself - in MS SQL 2000 "[no] exists" and "no" in "predicates" almost always a full scan was performed - we replaced them with the corresponding JOINs.
source to share
With only 20 records, the query optimizer can easily conclude that 3 table scans are the fastest way to do this, no matter how you express the query or what indexes you create. With small data records, the entire table is likely to boot with one or more disk reads; and there is not much optimization in the read unit; The optimizer is primarily interested in minimizing disk activity, and everything else is relatively unimportant. What the optimizer does with so many records is not a good indicator of how it will handle large volumes.
Do you have a real table with a lot of data? I cannot imagine that you can at all perceive any exectuion time.
source to share
Here is the unions version:
select distinct e1.* from #event e1
left outer join #event e2 ON
e1.id = e2.id
and e2.date < e1.date
and e2.type <> 1
left outer join #event e3 ON
e1.id = e3.id
and e3.date > e1.date
and e3.type <> 1
where e1.type = 1 AND (e2.id is null or e3.id is null)
This still has three table scans plus a separate clause, but it still seems to be faster than the original query.
source to share
Here's a query whose showplan at least doesn't say "table scan" in it and gives the same answer.
I still don't understand the query in logical terms.
SELECT DISTINCT e1. * FROM #event e1
WHERE e1.type = 1
AND
(
DOES NOT EXIST (
SELECT 1 FROM #event
WHERE type! = 1
AND id = e1.id
AND date <e1.date
)
OR DOES NOT EXIST (
SELECT 1 FROM #event
WHERE type ! = 1
AND id = e1.id
AND date> e1.date
)
)
source to share
This is better than the latter:
SELECT e1. * FROM event e1
WHERE e1.type = 1
AND DOES NOT EXIST (
SELECT 1 FROM event
WHERE type! = 1
AND id = e1.id
AND date <e1.date
)
UNION ALL SELECT e1. * FROM event e1
WHERE e1.type = 1
AND DOES NOT EXIST (
SELECT 1 FROM event
WHERE type! = 1
AND id = e1.id
AND date> e1.date
)
The result is the same. But your request, and my previous one, are terrible. Here's the IO stats:
(29985 lines (lines) affected)
Event table. Number of scans 9997, logical read 20477, physical read 0, read-read read 0.
Table "Desktop". Scan count 39979, logical read 99950, physical read 0, read forward - 0.
Here's the same for the most recent request:
(29985 lines (lines) affected)
Event table. Scan count 4, logical read 652, physical read 0, read forward - 0.
Two things to notice -
1. Even in the worst case, the entire table is loaded and there are no physical reads.
2. The bad ones have 9997 + 39979 table scans. Not only 4.
Describe the purpose of the request.
source to share