Multiple smallest T-SQL sets for common dates including all row IDs

Question

Multiple smallest T-SQL sets for common dates including all row IDs

My table (@MyTable) is a list of ids with start dates and end dates (inclusive), which represent the interval of days that the ID appears in the file, which is accepted once a day:

ID    Start_Date    End_Date
1     10/01/2014    12/15/2014
2     11/05/2014    03/03/2015
3     12/07/2014    12/09/2014
4     04/01/2015    04/15/2015

Each identifier appears only once, i.e. has only one associated time interval, and the intervals between Start_Dates and End_dates may (but need not) overlap with different IDs. I need a SQL query to search for date sets where each ID will appear at least once, when files from those date sets are combined into as few dates as possible. In the above table, the solution can be the following two dates:

File_Date     ID(s)
12/07/2015    1,2,3
04/01/2015    4

But for example there will be 1st date between ID (3) Start_date and End_date and combined with 1 date between ID (4) Start_date and End_date would be the solution.

The actual data consists of 10,000 different identifiers. The date range of possible file dates is 04/01/2014 - 07/01/2015. Each daily file is very large in size and has to be uploaded manually, so I want to minimize the number I have to upload to include all IDs.

So far, I have a CTE that results in separate lines for all dates between Start_Date and End_date of each ID:

;WITH cte (ID, d)
AS
(
    SELECT 
        tbl.ID AS ID,
        tbl.Start_Date AS d
    FROM @MyTable tbl
    UNION ALL
    SELECT 
        tbl.ID AS ID,
        DATEADD(DAY, 1, cte.d) AS d
    FROM cte
    INNER JOIN 
    @MyTable tbl ON cte.ID = tbl.ID
    WHERE cte.d < tbl.End_Date
)
SELECT
    ID AS ID,
    d AS File_Date 
FROM cte
ORDER BY ID,d
OPTION (MaxRecursion 500)

Using the results in @MyTable example:

ID    File_Date
1     10/01/2014
1     10/02/2014
1     10/03/2014
1     etc...

My thinking was to determine the most common File_Date among all ids, and then select the next most common File_Date among all remaining ids, etc ... but I'm stuck. To express this more mathematically, I am trying to find the smallest sets (File_Dates) that contain all elements (IDs) similar to https://softwareengineering.stackexchange.com/questions/263095/finding-the-fewest-sets-which-contain -all-items , but I don't care about minimizing duplicates. The end results do not need to include which identifiers appear in the File_Dates file; I just need to know all the File_Dates.

I am using MS SQL Server 2008.

+3

date sql sql-server tsql common-table-expression

RCheskin 16 jul. 15 at 20:33

source to share

2 answers

Using the suggested VBB approach and answer In SQL Server, how to create a while loop in a select as a model:

;WITH cte (ID, d)
AS
(
    SELECT 
        tbl.ID AS ID,
        tbl.Start_Date AS d
    FROM @MyTable tbl
    UNION ALL
    SELECT 
        tbl.ID AS ID,
        DATEADD(DAY, 1, cte.d) AS d
    FROM cte
    INNER JOIN 
    @MyTable tbl ON cte.ID = tbl.ID
    WHERE cte.d < tbl.End_Date
)
SELECT
    ID AS ID,
    d AS File_Date
    into #temp2
FROM cte
ORDER BY ID,d
OPTION (MaxRecursion 500)

Create Table #FileDates
(
File_Date date
)

GO

DECLARE @VarDate date

WHILE EXISTS (select * from #temp2)

BEGIN

SELECT TOP(1) 
@VarDate = File_Date
FROM #temp2
GROUP BY File_Date
ORDER BY COUNT(*) DESC;

INSERT INTO #FileDates (File_Date)
Values (@VarDate)

DELETE from #temp2
WHERE File_Date=@VarDate
OR ID in
(
    select t2.ID
    from #temp2 as t2
    where t2.File_Date = @VarDate
)

END

SELECT *
FROM #FileDates
ORDER BY File_Date

Took 30 seconds to return 40 file dates approximately. 4000 identifiers. Many thanks to Mr. Baranov!

0

RCheskin Jul 17 '15 at 8:18

source to share

Vladimir Baranov · Accepted Answer · 2015-07-17T01:58:54+0000

Just continue what you started. The result found by this method is not optimal, but may be good enough for your purposes.

For each identifier, a rowset is generated for each day in the range. You already know how to do this, although I would use a table of numbers for this, instead of generating it on the fly with CTE every time, but it doesn't really matter.

Place the result in a temporary table. It will have 10,000 IDs * ~ 400 days = ~ 4M lines. The temp table has two columns (ID, FileDate)

. Create appropriate indexes. I would start with two: on (ID, FileDate)

and on (FileDate, ID)

. Make one of them the clustered and primary key. I tried to do (FileDate, ID)

as a clustered primary key.

Then process in a loop:

Find a date with a lot of IDs:

SELECT TOP(1) @VarDate = FileDate
FROM #temp
GROUP BY FileDate
ORDER BY COUNT(*) DESC;

Remember the found date (and possibly its IDs) in another temporary table for the final result.

Remove date and ids corresponding to this date from large table.

DELETE FROM #temp
WHERE FileDate = @VarDate
OR ID IN
(
    SELECT t2.ID
    FROM #temp AS t2
    WHERE t2.FileDate = @VarDate
)

Repeat the loop until there are no lines in #temp.

Multiple smallest T-SQL sets for common dates including all row IDs

More articles: