Finding Header and Tail Events in SQL Server (Optimization)

I have an event table, I need to find all type 1 tail events and all type 1 head events.

So, for a set of events in this order [1, 1], 3, 1, 4, 5, [1,1,1], the parentheses denote head and tail events of type 1.

This is better illustrated in SQL:

drop table #event
go 
create table #event (group_id int, [date] datetime, [type] int)
create index idx1 on #event (group_id, date)


insert into #event values (1, '2000-01-01', 1) 
insert into #event values (1, '2000-01-02', 1) 
insert into #event values (1, '2000-01-03', 3) 
insert into #event values (1, '2000-01-04', 2) 
insert into #event values (1, '2000-01-05', 1) 
insert into #event values (2, '2000-01-01', 2) 
insert into #event values (2, '2000-01-02', 2) 
insert into #event values (2, '2000-01-03', 3) 
insert into #event values (2, '2000-01-04', 2) 
insert into #event values (2, '2000-01-05', 1) 
insert into #event values (3, '2000-01-01', 1) 
insert into #event values (3, '2000-01-02', 2) 
insert into #event values (3, '2000-01-03', 1) 
insert into #event values (3, '2000-01-04', 2) 
insert into #event values (3, '2000-01-05', 2) 
insert into #event values (4, '2000-01-01', 2) 
insert into #event values (4, '2000-01-02', 2) 
insert into #event values (4, '2000-01-03', 3) 
insert into #event values (4, '2000-01-04', 1) 
insert into #event values (4, '2000-01-05', 1) 

go 

select e1.* from #event e1 
where (
    not exists (
        select top 1 1 
        from #event e2 
        where e1.group_id = e2.group_id 
        and e2.date < e1.date 
        and e2.type <> 1
    ) or not exists (
        select top 1 1 
        from #event e2 
        where e1.group_id = e2.group_id 
        and e2.date > e1.date 
        and e2.type <> 1
    ) 
)
and e1.type = 1 

      

Expected results:

1   2000-01-01 00:00:00.000 1
1   2000-01-02 00:00:00.000 1
1   2000-01-05 00:00:00.000 1
2   2000-01-05 00:00:00.000 1
3   2000-01-01 00:00:00.000 1
4   2000-01-04 00:00:00.000 1
4   2000-01-05 00:00:00.000 1

      

This all works fine and returns my expected results, but it is scanning the table 3 times. Is there a way to do this faster and reduce the number of table scans?

0


source to share


9 replies


To create a large subset of your data, you can use this:

declare @i int 
set @i = 10000
while @i > 5 
begin
    insert into #event values (@i, '2000-01-01', 1) 
    insert into #event values (@i, '2000-01-02', 1) 
    insert into #event values (@i, '2000-01-03', 3) 
    insert into #event values (@i, '2000-01-04', 2) 
    insert into #event values (@i, '2000-01-05', 1)  
    set @i = @i -1 
end 

      

Also, to include many events in a group, try this:



declare @j int 
set @j = 0 
while @j < 10
begin 
    set nocount on 
    declare @i int 
    set @i = 0
    while @i < 10000 
    begin
        insert into #event values (@j, DateAdd(d, @i, '2000-01-01'), rand(10) * 10) 

        set @i = @i  +1 
    end
    set @j = @j + 1 
end
set nocount off

      

In all of my testing, it seems that my original query only produces 3 table scans, and I'm not sure if there is any performance improvement here.

+1


source


I think this is better:



select e1.group_id, e1.date, e1.type
from #event e1, #event e2
where e1.type = 1
and e2.type <> 1
and e1.group_id = e2.group_id
group by e1.group_id, e1.date , e1.type, e2.group_id
having e1.date <min (e2.date) or e1.date> max (e2.date)

+1


source


From what I understand, you are after that - head and tail ordered by day ** for each id **. The head and tail are all records until a record is encountered where the type is not one.

This is another way to do it, it could be faster

;WITH Ranked AS (
    SELECT 
        *,
        Row_Number() OVER (ORDER BY date) as 'rnk'
    FROM #event
)


SELECT * 
FROM Ranked
WHERE rnk not between 
        (SELECT Min(rnk) FROM Ranked r WHERE r.type <> 1 AND ranked.id = r.id)
        AND (SELECT Max(rnk) FROM Ranked r WHERE r.type <> 1 AND ranked.id = r.id)
order by id

      

You don't need to use TOP with the exist operator, although it doesn't hurt.

0


source


Basically it looks like you want to get a type 1 event that is less than the timestamp than the other events and more than the timestamp of a different date Try this, written in Oracle Syntax, not sure about MSSQL.

select e1. * from e1, where e1.id = 1 and (e1.date <= (select min (e2.date) from e2, where e2.id <> 1 group by e2.date) or (e1.date> = select max (e3.date) from e3, where e3.id <> 1 group by e3.date))

0


source


Consider

create idx2 index on #event (type)

I don't have a SQL Server to check, but in Oracle it will eliminate the top level scan (for the "type = 1" condition).

As for the query itself - in MS SQL 2000 "[no] exists" and "no" in "predicates" almost always a full scan was performed - we replaced them with the corresponding JOINs.

0


source


With only 20 records, the query optimizer can easily conclude that 3 table scans are the fastest way to do this, no matter how you express the query or what indexes you create. With small data records, the entire table is likely to boot with one or more disk reads; and there is not much optimization in the read unit; The optimizer is primarily interested in minimizing disk activity, and everything else is relatively unimportant. What the optimizer does with so many records is not a good indicator of how it will handle large volumes.

Do you have a real table with a lot of data? I cannot imagine that you can at all perceive any exectuion time.

0


source


Here is the unions version:

select distinct e1.* from #event e1 
left outer join #event e2 ON 
        e1.id = e2.id 
        and e2.date < e1.date 
        and e2.type <> 1
left outer join #event e3 ON
        e1.id = e3.id 
        and e3.date > e1.date 
        and e3.type <> 1
where e1.type = 1 AND (e2.id is null or e3.id is null)

      

This still has three table scans plus a separate clause, but it still seems to be faster than the original query.

0


source


Here's a query whose showplan at least doesn't say "table scan" in it and gives the same answer.

I still don't understand the query in logical terms.

SELECT DISTINCT e1. * FROM #event e1
WHERE e1.type = 1
  AND
  (
      DOES NOT EXIST (
          SELECT 1 FROM #event
          WHERE type! = 1
              AND id = e1.id
              AND date <e1.date
      )
      OR DOES NOT EXIST (
          SELECT 1 FROM #event
          WHERE type ! = 1
              AND id = e1.id
              AND date> e1.date
      )
  )

0


source


This is better than the latter:

SELECT e1. * FROM event e1
WHERE e1.type = 1
  AND DOES NOT EXIST (
      SELECT 1 FROM event
      WHERE type! = 1
          AND id = e1.id
          AND date <e1.date
  )
UNION ALL SELECT e1. * FROM event e1
WHERE e1.type = 1
  AND DOES NOT EXIST (
      SELECT 1 FROM event
      WHERE type! = 1
          AND id = e1.id
          AND date> e1.date
  )

The result is the same. But your request, and my previous one, are terrible. Here's the IO stats:

(29985 lines (lines) affected)

Event table. Number of scans 9997, logical read 20477, physical read 0, read-read read 0.
Table "Desktop". Scan count 39979, logical read 99950, physical read 0, read forward - 0.

Here's the same for the most recent request:

(29985 lines (lines) affected)

Event table. Scan count 4, logical read 652, physical read 0, read forward - 0.

Two things to notice -
1. Even in the worst case, the entire table is loaded and there are no physical reads.
2. The bad ones have 9997 + 39979 table scans. Not only 4.

Describe the purpose of the request.

0


source







All Articles