Aggregate for each day by time, without using logic without equivalence

Initial question

Given the following dataset paired with a date table:

MembershipId | ValidFromDate | ValidToDate
==========================================
0001         | 1997-01-01    | 2006-05-09
0002         | 1997-01-01    | 2017-05-12
0003         | 2005-06-02    | 2009-02-07

      

How many Memberships

were open on any given day or over several days?

Original Answer

Following this question asked here , this answer provided the required functionality:

select d.[Date]
      ,count(m.MembershipID) as MembershipCount
from DIM.[Date] as d
    left join Memberships as m
        on(d.[Date] between m.ValidFromDateKey and m.ValidToDateKey)
where d.CalendarYear = 2016
group by d.[Date]
order by d.[Date];

      

although the commenter noticed that there are other approaches where not equijoin takes too long.

Followup

As such, that the equijoin logic only resembles replicating the query output above?


Progress so far

From the answers provided so far, I've provided the one below that outperforms the hardware I'm working with across 3.2 million entries Membership

:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
)
,e as
(
    select d.[Date] as d
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
),c as
(
    select isnull(s.d,e.d) as d
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
)
select d.[Date]
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
;

      

Following this, in order to split this aggregate into composite groups per day, I have the following, which also works well:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,s.MembershipGrouping as g
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
            ,s.MembershipGrouping
)
,e as
(
    select d.[Date] as d
        ,e..MembershipGrouping as g
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
            ,e.MembershipGrouping
),c as
(
    select isnull(s.d,e.d) as d
            ,isnull(s.g,e.g) as g
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (partition by isnull(s.g,e.g) order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
                and s.g = e.g
)
select d.[Date]
    ,c.g
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
        ,c.g
;

      

Can anyone improve this above?

+17


source to share


6 answers


Assuming your date dimension contains all dates contained in all membership periods, you can use something like the following.

A join is an equi join, so it is possible to use a hash join or join a join of more than just nested loops (which will execute the inner subtree for every outer row).

Assuming the index is on (ValidToDate) include(ValidFromDate)

or reverse, you can use single search against Memberships

and single lookup of the date dimension. The below given time is less than mine to return year results against a table with 3.2 million members and 1.4 million total active membership ( script )



DECLARE @StartDate DATE = '2016-01-01',
        @EndDate   DATE = '2016-12-31';

WITH MD
     AS (SELECT Date,
                SUM(Adj) AS MemberDelta
         FROM   Memberships
                CROSS APPLY (VALUES ( ValidFromDate, +1),
                                    --Membership count decremented day after the ValidToDate
                                    (DATEADD(DAY, 1, ValidToDate), -1) ) V(Date, Adj)
         WHERE
          --Members already expired before the time range of interest can be ignored
          ValidToDate >= @StartDate
          AND
          --Members whose membership starts after the time range of interest can be ignored
          ValidFromDate <= @EndDate
         GROUP  BY Date),
     MC
     AS (SELECT DD.DateKey,
                SUM(MemberDelta) OVER (ORDER BY DD.DateKey ROWS UNBOUNDED PRECEDING) AS CountOfNonIgnoredMembers
         FROM   DIM_DATE DD
                LEFT JOIN MD
                  ON MD.Date = DD.DateKey)
SELECT DateKey,
       CountOfNonIgnoredMembers AS MembershipCount
FROM   MC
WHERE  DateKey BETWEEN @StartDate AND @EndDate 
ORDER BY DateKey

      

Demo (using extended period as calendar year 2016 is not very interesting with sample data)

enter image description here

+6


source


If most of the validity intervals for your membership are longer than a few days, take a look at Martin Smith's answer. This approach will probably be faster.


When you grab the calendar table ( DIM.[Date]

) and join it to the left Memberships

, you can complete the table scan Memberships

for each date in the range. Even if (ValidFromDate, ValidToDate)

there is no index, it might not be very useful.

It is easy to rotate it. Scan the table Memberships

just once and for each membership find the dates that are valid using CROSS APPLY

.

Sample data

DECLARE @T TABLE (MembershipId int, ValidFromDate date, ValidToDate date);

INSERT INTO @T VALUES
(1, '1997-01-01', '2006-05-09'),
(2, '1997-01-01', '2017-05-12'),
(3, '2005-06-02', '2009-02-07');

DECLARE @RangeFrom date = '2006-01-01';
DECLARE @RangeTo   date = '2006-12-31';

      

Request 1

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= Memberships.ValidFromDate
            AND dbo.Calendar.dt <= Memberships.ValidToDate
            AND dbo.Calendar.dt >= @RangeFrom
            AND dbo.Calendar.dt <= @RangeTo
    ) AS CA
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE);

      

OPTION(RECOMPILE)

not really required, I include it in all queries when I compare execution plans to make sure I get the latest plan when I play with the queries.

When I looked at the plan for this query, I saw that the table lookup Calendar.dt

was only used ValidFromDate

and ValidToDate

, @RangeFrom

and tags @RangeTo

were moved to the remainder predicate. It's not perfect. The optimizer is not smart enough to calculate the maximum of two dates ( ValidFromDate

and @RangeFrom

) and use that date as the starting point of the search.

search 1

It is easy to help the optimizer:

Request 2



SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= 
                CASE WHEN Memberships.ValidFromDate > @RangeFrom 
                THEN Memberships.ValidFromDate 
                ELSE @RangeFrom END
            AND dbo.Calendar.dt <= 
                CASE WHEN Memberships.ValidToDate < @RangeTo 
                THEN Memberships.ValidToDate 
                ELSE @RangeTo END
    ) AS CA
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE)
;

      

In this query, the search is optimal and does not read dates that might be discarded later.

search 2

Finally, you may not need to scan the entire table Memberships

. We only need rows where the given date range overlaps with the valid membership range.

Request 3

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= 
                CASE WHEN Memberships.ValidFromDate > @RangeFrom 
                THEN Memberships.ValidFromDate 
                ELSE @RangeFrom END
            AND dbo.Calendar.dt <= 
                CASE WHEN Memberships.ValidToDate < @RangeTo 
                THEN Memberships.ValidToDate 
                ELSE @RangeTo END
    ) AS CA
WHERE
    Memberships.ValidToDate >= @RangeFrom
    AND Memberships.ValidFromDate <= @RangeTo
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE)
;

      

Two intervals [a1;a2]

and [b1;b2]

intersect when

a2 >= b1 and a1 <= b2

      

These queries assume that the table Calendar

has an index on dt

.

You should try and see which indexes are best for the table Memberships

. For the last query, if the table is fairly large, it is likely that two separate indexes on ValidFromDate

and on ValidToDate

are better than one index on (ValidFromDate, ValidToDate)

.

You should try different queries and measure their performance on real hardware with real data. Performance can depend on the distribution of the data, the number of memberships in it, what their actual dates are, how wide or narrow the range is, etc.

I recommend using an excellent tool called SQL Sentry Plan Explorer to analyze and compare execution plans. It's free. It shows a lot of useful statistics like execution time and number of reads for each request. The screenshots above are from this tool.

+12


source


One approach is to first use INNER JOIN to find a set of matches and COUNT () for a project MemberCount GROUPed BY DateKey, then UNION ALL with the same set of dates, from 0 on that projection to count members for each date. The last step is SUM () the MemberCount of this union and the GROUP BY DateKey. As requested, this avoids LEFT PASS and DOES NOT EXIST. As another member pointed out, this is not an equi because we need to use a range, but I think it does what you intend.

This will serve 1 year of data with logical reads of about 100 kilobytes. On a typical laptop with a spinning cold cache drive, it serves 1 month per second (with correct values).

Here's an example that creates 3.3 million lines of random length. The query below returns data for one month.

--Stay quiet for a moment
SET NOCOUNT ON
SET STATISTICS IO OFF
SET STATISTICS TIME OFF

--Clean up if re-running
DROP TABLE IF EXISTS DIM_DATE
DROP TABLE IF EXISTS FACT_MEMBER

--Date dimension
CREATE TABLE DIM_DATE
  (
  DateKey DATE NOT NULL 
  )

--Membership fact
CREATE TABLE FACT_MEMBER
  (
  MembershipId INT NOT NULL
  , ValidFromDateKey DATE NOT NULL
  , ValidToDateKey DATE NOT NULL
  )

--Populate Date dimension from 2001 through end of 2018
DECLARE @startDate DATE = '2001-01-01'
DECLARE @endDate DATE = '2018-12-31'
;WITH CTE_DATE AS
(
SELECT @startDate AS DateKey
UNION ALL
SELECT
       DATEADD(DAY, 1, DateKey)
FROM
       CTE_DATE AS D
WHERE
       D.DateKey < @endDate
)
INSERT INTO
  DIM_DATE
  (
  DateKey
  )
SELECT
  D.DateKey
FROM
  CTE_DATE AS D
OPTION (MAXRECURSION 32767)

--Populate Membership fact with members having a random membership length from 1 to 36 months 
;WITH CTE_DATE AS
(
SELECT @startDate AS DateKey
UNION ALL
SELECT
       DATEADD(DAY, 1, DateKey)
FROM
       CTE_DATE AS D
WHERE
       D.DateKey < @endDate
)
,CTE_MEMBER AS
(
SELECT 1 AS MembershipId
UNION ALL
SELECT MembershipId + 1 FROM CTE_MEMBER WHERE MembershipId < 500
)
,
CTE_MEMBERSHIP
AS
(
SELECT
  ROW_NUMBER() OVER (ORDER BY NEWID()) AS MembershipId
  , D.DateKey AS ValidFromDateKey
FROM
  CTE_DATE AS D
  CROSS JOIN CTE_MEMBER AS M
)
INSERT INTO
    FACT_MEMBER
    (
    MembershipId
    , ValidFromDateKey
    , ValidToDateKey
    )
SELECT
    M.MembershipId
    , M.ValidFromDateKey
      , DATEADD(MONTH, FLOOR(RAND(CHECKSUM(NEWID())) * (36-1)+1), M.ValidFromDateKey) AS ValidToDateKey
FROM
    CTE_MEMBERSHIP AS M
OPTION (MAXRECURSION 32767)

--Add clustered Primary Key to Date dimension
ALTER TABLE DIM_DATE ADD CONSTRAINT PK_DATE PRIMARY KEY CLUSTERED
    (
    DateKey ASC
    )

--Index
--(Optimize in your spare time)
DROP INDEX IF EXISTS SK_FACT_MEMBER ON FACT_MEMBER
CREATE CLUSTERED INDEX SK_FACT_MEMBER ON FACT_MEMBER
    (
    ValidFromDateKey ASC
    , ValidToDateKey ASC
    , MembershipId ASC
    )


RETURN

--Start test
--Emit stats
SET STATISTICS IO ON
SET STATISTICS TIME ON

--Establish range of dates
DECLARE
  @rangeStartDate DATE = '2010-01-01'
  , @rangeEndDate DATE = '2010-01-31'

--UNION the count of members for a specific date range with the "zero" set for the same range, and SUM() the counts
;WITH CTE_MEMBER
AS
(
SELECT
    D.DateKey
    , COUNT(*) AS MembershipCount
FROM
    DIM_DATE AS D
    INNER JOIN FACT_MEMBER AS M ON
        M.ValidFromDateKey <= @rangeEndDate
        AND M.ValidToDateKey >= @rangeStartDate
        AND D.DateKey BETWEEN M.ValidFromDateKey AND M.ValidToDateKey
WHERE
    D.DateKey BETWEEN @rangeStartDate AND @rangeEndDate
GROUP BY
    D.DateKey

UNION ALL

SELECT
    D.DateKey
    , 0 AS MembershipCount
FROM
    DIM_DATE AS D
WHERE
    D.DateKey BETWEEN @rangeStartDate AND @rangeEndDate
)
SELECT
    M.DateKey
    , SUM(M.MembershipCount) AS MembershipCount
FROM
    CTE_MEMBER AS M
GROUP BY
    M.DateKey
ORDER BY
    M.DateKey ASC
OPTION (RECOMPILE, MAXDOP 1)

      

+2


source


This is how I would solve this problem with equijoin:

--data generation
declare @Membership table (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insert into @Membership values
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')

declare @startDate date, @endDate date
select @startDate =  MIN(ValidFromDate), @endDate = max(ValidToDate) from @Membership
--in order to use equijoin I need all days between min date and max date from Membership table (both columns)
;with cte as (
    select @startDate [date]
    union all
    select DATEADD(day, 1, [date]) from cte
    where [date] < @endDate
)
--in this query, we will assign value to each day:
--one, if project started on that day
--minus one, if project ended on that day
--then, it enough to (cumulative) sum all this values to get how many projects were ongoing on particular day
select [date],
       sum(case when [DATE] = ValidFromDate then 1 else 0 end +
            case when [DATE] = ValidToDate then -1 else 0 end)
            over (order by [date] rows between unbounded preceding and current row)
from cte [c]
left join @Membership [m]
on [c].[date] = [m].ValidFromDate  or [c].[date] = [m].ValidToDate
option (maxrecursion 0)

      

Here's another solution:

--data generation
declare @Membership table (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insert into @Membership values
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')

;with cte as (
    select CAST('2016-01-01' as date) [date]
    union all
    select DATEADD(day, 1, [date]) from cte
    where [date] < '2016-12-31'
)

select [date],
       (select COUNT(*) from @Membership where ValidFromDate < [date]) - 
       (select COUNT(*) from @Membership where ValidToDate < [date]) [ongoing]
from cte
option (maxrecursion 0)

      

+1


source


Note, I think @PittsburghDBA is right when he says the current query is returning the wrong result.
The last day of membership is not counted and therefore the final amount is lower than it should be.
I fixed it in this version.

This should improve your actual progress a bit:

declare @s date = '20160101';
declare @e date = getdate();

with 
x as (
    select d, sum(c) c
    from (
        select ValidFromDateKey d, count(MembershipID) c
        from Memberships
        group by ValidFromDateKey 

        union all

        -- dateadd needed to count last day of membership too!!
        select dateadd(dd, 1, ValidToDateKey) d, -count(MembershipID) c
        from Memberships
        group by ValidToDateKey 
    )x
    group by d
),
c as
(
    select d, sum(x.c) over (order by d) as c
    from x
)
select d.day, c cnt
from calendar d
left join c on d.day = c.d
where d.day between @s and @e
order by d.day;

      

+1


source


First of all, your query gives "1" as MembershipCount

even if there is no active member for the given date.

You must return SUM(CASE WHEN m.MembershipID IS NOT NULL THEN 1 ELSE 0 END) AS MembershipCount

.

For optimal performance, create an index on Memberships(ValidFromDateKey, ValidToDateKey, MembershipId)

and another on DIM.[Date](CalendarYear, DateKey)

.

In this case, the optimal query should be:

DECLARE @CalendarYear INT = 2000

SELECT dim.DateKey, SUM(CASE WHEN con.MembershipID IS NOT NULL THEN 1 ELSE 0 END) AS MembershipCount
FROM
    DIM.[Date] dim
        LEFT OUTER JOIN (
            SELECT ValidFromDateKey, ValidToDateKey, MembershipID
            FROM Memberships
            WHERE
                    ValidFromDateKey <= CONVERT(DATETIME, CONVERT(VARCHAR, @CalendarYear) + '1231')
                AND ValidToDateKey   >= CONVERT(DATETIME, CONVERT(VARCHAR, @CalendarYear) + '0101')
        ) con
        ON dim.DateKey BETWEEN con.ValidFromDateKey AND con.ValidToDateKey
WHERE dim.CalendarYear = @CalendarYear
GROUP BY dim.DateKey
ORDER BY dim.DateKey

      

Now, for your last question, what would the equivalent equijoin query be.

There is NO WAY , you can rewrite this as not equal!

Equijoin does not imply the use of join

sintax. Equijoin implies the use of a predicate equals

, regardless of sintax.

The query results in a range comparison, so it equals

does not apply: required between

or similar.

-1


source







All Articles