SQL Server 2012 buckets based on grand total
For SQL Server 2012, I am trying to assign given rows to sequential buckets based on the maximum bucket size (100 in the example below) and the entire column execution. Most of the solutions I have found split on the known value changing the column value, eg. section by department ID, etc. However, in this situation, all I have is a sequential ID and size. The closest solution I've found is discussed in this thread for SQL Server 2008 and I've tried it, but performance is very slow for large rows, much worse than cursor based solution. https://dba.stackexchange.com/questions/45179/how-can-i-write-windowing-query-which-sums-a-column-to-create-discrete-buckets
This table can contain up to 10 million rows. SQL Server 2012 supports SUM OVER and LAG and LEAD features, wondering if anyone could suggest a 2012 based solution.
CREATE TABLE raw_data (
id INT PRIMARY KEY
, size INT NOT NULL
);
INSERT INTO raw_data
(id, size)
VALUES
( 1, 96) -- new bucket here, maximum bucket size is 100
, ( 2, 10) -- and here
, ( 3, 98) -- and here
, ( 4, 20)
, ( 5, 50)
, ( 6, 15)
, ( 7, 97)
, ( 8, 96) -- and here
;
--Expected output
--bucket_size is for illustration only, actual needed output is bucket only
id size bucket_size bucket
-----------------------------
1 100 100 1
2 10 10 2
3 98 98 3
4 20 85 4
5 50 85 4
6 15 85 4
7 97 98 5
8 1 98 5
TIA
source to share
Before you can use the current method to assign bucket numbers, you need to generate this column bucket_size
, because numbers will be generated based on this column.
Depending on your expected result, the bucket ranges
1..10
11..85
86..100
You can use a simple CASE expression like this to create a column bucket_size
like in your example:
CASE
WHEN size <= 10 THEN 10
WHEN size <= 85 THEN 85
ELSE 100
END
Then you have to use LAG () to determine if the line starts a new sequence of sizes belonging to the same bucket:
CASE bucket_size
WHEN LAG(bucket_size) OVER (ORDER BY id) THEN 0
ELSE 1
END
These two calculations can be done in the same (sub) query using CROSS APPLY:
SELECT
d.id,
d.size,
x.bucket_size, -- for illustration only
is_new_seq = CASE x.bucket_size
WHEN LAG(x.bucket_size) OVER (ORDER BY d.id) THEN 0
ELSE 1
END
FROM dbo.raw_data AS d
CROSS APPLY
(
SELECT
CASE
WHEN size <= 10 THEN 10
WHEN size <= 85 THEN 85
ELSE 100
END
) AS x (bucket_size)
The above query will produce this output:
id size bucket_size is_new_seq
-- ---- ----------- ----------
1 96 100 1
2 10 10 1
3 98 100 1
4 20 85 1
5 50 85 0
6 15 85 0
7 97 100 1
8 96 100 0
Now use this result as a derived table and apply SUM () OVER to is_new_seq
to create bucket numbers like this :
SELECT
id,
size,
bucket = SUM(is_new_seq) OVER (ORDER BY id)
FROM
(
SELECT
d.id,
d.size,
is_new_seq = CASE x.bucket_size
WHEN LAG(x.bucket_size) OVER (ORDER BY d.id) THEN 0
ELSE 1
END
FROM dbo.raw_data AS d
CROSS APPLY
(
SELECT
CASE
WHEN size <= 10 THEN 10
WHEN size <= 85 THEN 85
ELSE 100
END
) AS x (bucket_size)
) AS s
;
source to share
You can achieve this easily in SQL Server 2012 with windowing and cropping. The syntax looks pretty complicated, but the concept is simple - sum all the previous lines up to and including the current one. The cumulative_bucket_size column in this example is for demonstration purposes, as it is part of the equation used to get the bucket number:
DECLARE @Bucket_Size AS INT;
SET @Bucket_Size = 100
SELECT
id,
size,
SUM(size) OVER (
PARTITION BY 1 ORDER BY id ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_bucket_size,
1 + SUM(size) OVER (
PARTITION BY 1 ORDER BY id ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) / @Bucket_Size AS bucket
FROM
raw_data
The PARTITION BY clause is optional, but it would be helpful if you had different "bucket sets" for grouping columns. I've added it here for completeness.
Results:
id size cumulative_bucket_size bucket
------------------------------------------
1 96 96 1
2 10 106 2
3 98 204 3
4 20 224 3
5 50 274 3
6 15 289 3
7 97 386 4
8 96 482 5
You can read more about cropping windows in the following article:
https://www.simple-talk.com/sql/learn-sql-server/window-functions-in-sql-server-part-2-the-frame/
source to share