SQL Server 2012 buckets based on grand total

Question

SQL Server 2012 buckets based on grand total

For SQL Server 2012, I am trying to assign given rows to sequential buckets based on the maximum bucket size (100 in the example below) and the entire column execution. Most of the solutions I have found split on the known value changing the column value, eg. section by department ID, etc. However, in this situation, all I have is a sequential ID and size. The closest solution I've found is discussed in this thread for SQL Server 2008 and I've tried it, but performance is very slow for large rows, much worse than cursor based solution. https://dba.stackexchange.com/questions/45179/how-can-i-write-windowing-query-which-sums-a-column-to-create-discrete-buckets

This table can contain up to 10 million rows. SQL Server 2012 supports SUM OVER and LAG and LEAD features, wondering if anyone could suggest a 2012 based solution.

CREATE TABLE raw_data (
id    INT PRIMARY KEY
, size  INT NOT NULL
);

INSERT INTO raw_data
(id, size)
VALUES 
  ( 1,    96) -- new bucket here, maximum bucket size is 100
, ( 2,    10) -- and here
, ( 3,    98) -- and here
, ( 4,    20)
, ( 5,    50)
, ( 6,    15)
, ( 7,    97)
, ( 8,    96) -- and here
;

--Expected output
--bucket_size is for illustration only, actual needed output is bucket only

id  size    bucket_size bucket
-----------------------------
1   100     100         1
2    10     10          2
3    98     98          3
4    20     85          4
5    50     85          4
6    15     85          4
7    97     98          5
8    1      98          5

TIA

+3

sql-server-2012 sum running-total

Emerald77 13 Aug 14 at 19:29

source to share

2 answers

Andriy M · Answer 1 · 2014-08-15T19:12:46+0000

Before you can use the current method to assign bucket numbers, you need to generate this column bucket_size

, because numbers will be generated based on this column.

Depending on your expected result, the bucket ranges

1..10
11..85
86..100

You can use a simple CASE expression like this to create a column bucket_size

like in your example:

CASE
    WHEN size <= 10 THEN 10
    WHEN size <= 85 THEN 85
    ELSE 100
END

Then you have to use LAG () to determine if the line starts a new sequence of sizes belonging to the same bucket:

CASE bucket_size
    WHEN LAG(bucket_size) OVER (ORDER BY id) THEN 0
    ELSE 1
END

These two calculations can be done in the same (sub) query using CROSS APPLY:

SELECT
    d.id,
    d.size,
    x.bucket_size,  -- for illustration only
    is_new_seq = CASE x.bucket_size
                     WHEN LAG(x.bucket_size) OVER (ORDER BY d.id) THEN 0
                     ELSE 1
                 END
FROM dbo.raw_data AS d
CROSS APPLY
(
    SELECT
        CASE
            WHEN size <= 10 THEN 10
            WHEN size <= 85 THEN 85
            ELSE 100
        END
) AS x (bucket_size)

The above query will produce this output:

id  size  bucket_size  is_new_seq
--  ----  -----------  ----------
1   96    100          1
2   10    10           1
3   98    100          1
4   20    85           1
5   50    85           0
6   15    85           0
7   97    100          1
8   96    100          0

Now use this result as a derived table and apply SUM () OVER to is_new_seq

to create bucket numbers like this :

SELECT
    id,
    size,
    bucket = SUM(is_new_seq) OVER (ORDER BY id)
FROM
(
    SELECT
        d.id,
        d.size,
        is_new_seq = CASE x.bucket_size
                         WHEN LAG(x.bucket_size) OVER (ORDER BY d.id) THEN 0
                         ELSE 1
                     END
    FROM dbo.raw_data AS d
    CROSS APPLY
    (
        SELECT
            CASE
                WHEN size <= 10 THEN 10
                WHEN size <= 85 THEN 85
                ELSE 100
            END
    ) AS x (bucket_size)
) AS s
;

SQLDiver · Answer 2 · 2015-04-12T16:06:01+0000

You can achieve this easily in SQL Server 2012 with windowing and cropping. The syntax looks pretty complicated, but the concept is simple - sum all the previous lines up to and including the current one. The cumulative_bucket_size column in this example is for demonstration purposes, as it is part of the equation used to get the bucket number:

DECLARE @Bucket_Size AS INT;
SET @Bucket_Size = 100

SELECT
    id,
    size,
    SUM(size) OVER  (
                        PARTITION BY 1 ORDER BY id ASC
                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
                    ) AS cumulative_bucket_size,
    1 + SUM(size) OVER  (
                        PARTITION BY 1 ORDER BY id ASC
                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
                    ) / @Bucket_Size AS bucket
FROM
    raw_data

The PARTITION BY clause is optional, but it would be helpful if you had different "bucket sets" for grouping columns. I've added it here for completeness.

Results:

id  size    cumulative_bucket_size  bucket
------------------------------------------
1   96      96                      1
2   10      106                     2
3   98      204                     3
4   20      224                     3
5   50      274                     3
6   15      289                     3
7   97      386                     4
8   96      482                     5

You can read more about cropping windows in the following article:

https://www.simple-talk.com/sql/learn-sql-server/window-functions-in-sql-server-part-2-the-frame/

SQL Server 2012 buckets based on grand total

More articles: