How to avoid unnecessary sorting in SQL Server GROUP BY?

I have tables of data samples, with a timestamp and some data. Each table has a clustered index on a timestamp followed by a data-specific key. Data samples are not necessarily equidistant.

I need to shrink the data in a specific time range in order to draw plots - say 100,000 rows to N, where N is about 50. While I might need to compromise the "correctness" of the algorithm from a DSP point of view, I would like to keep that in SQL for reasons productivity.

My current idea is to group samples in a time range in N fields and then take the average for each group. One way to achieve this in SQL is to apply a section function to a date that ranges from 0 to N-1 (inclusive) and then GROUP BY and AVG.

I think this GROUP BY can be done without sorting because the date is from a clustered index and the section function is monotonic. However, SQL Server doesn't seem to notice this, and it issues a sort that is 78% of the execution cost (in the example below). Assuming I am correct and this view is not needed, I could make the query 5x faster.

Is there a way to force SQL Server to skip sorting? Or is there a better way to approach the problem?

Greetings. Ben

IF EXISTS(SELECT name FROM sysobjects WHERE name = N'test') DROP TABLE test

CREATE TABLE test
(
  date DATETIME NOT NULL,
  v FLOAT NOT NULL,
  CONSTRAINT PK_test PRIMARY KEY CLUSTERED (date ASC, v ASC)
)

INSERT INTO test (date, v) VALUES ('2009-08-22 14:06:00.000', 1)
INSERT INTO test (date, v) VALUES ('2009-08-22 17:09:00.000', 8)
INSERT INTO test (date, v) VALUES ('2009-08-24 00:00:00.000', 2)
INSERT INTO test (date, v) VALUES ('2009-08-24 03:00:00.000', 9)
INSERT INTO test (date, v) VALUES ('2009-08-24 14:06:00.000', 7)

-- the lower bound is set to the table min for demo purposes; in reality
-- it could be any date
declare @min float
set @min = cast((select min(date) from test) as float)

-- similarly for max
declare @max float
set @max = cast((select max(date) from test) as float)

-- the number of results to return (assuming enough data is available)
declare @count int
set @count = 3

-- precompute scale factor
declare @scale float
set @scale =  (@count - 1) / (@max - @min)
select @scale

-- this scales the dates from 0 to n-1
select (cast(date as float) - @min) * @scale, v from test

-- this rounds the scaled dates to the nearest partition,
-- groups by the partition, and then averages values in each partition
select round((cast(date as float) - @min) * @scale, 0), avg(v) from test
group by round((cast(date as float) - @min) * @scale, 0)

      

+2


source to share


3 answers


SQL Server does not know that a clustered key date

can be used for a type expression round(cast.. as float))

to guarantee order. Only this will throw him off the track. Add in (... -@min) * @scale

and you have the perfect mess. If you need to sort and group by such expressions, store them in persistent calculated columns and index them. You probably want to use DATEPART

, though, going through an imprecise type like float will likely make the expression unusable for constant computed columns.

Update

Related date

and float

equivalent:

declare @f float, @d datetime;
select @d = cast(1 as datetime);
select @f = cast(1 as float);
select cast(@d as varbinary(8)), cast(@f as varbinary(8)), @d, cast(@d as float)

      

Produces the following:



0x0000000100000000  0x3FF0000000000000  1900-01-02 00:00:00.000 1

      

So you can see that they are all stored in 8 bytes (at least float(25...53)

), the internal representation is datetime

not float

with an integer part, and the day and fractional part is the time (as is often assumed).

To give another example:

declare @d datetime;
select @d = '1900-01-02 12:00 PM';
select cast(@d as varbinary(8)), cast(@d as float)

0x0000000100C5C100  1.5

      

Again the result of casting @d

to float

is equal 1.5

, but the internal representation of the datetime 0x0000000100C5C100

will be an IEEE double 2.1284E-314

, not 1.5

.

+2


source


Yes, SQL-Server has always had some problems with SELECT summaries like this. Analysis Services has many ways of processing, but the Data Servies side is more limited.



What I suggest you try (I can't try or try anything from here) is to make a secondary "partition table" that contains the yor partition definitions and then join it. You will need some math indices for it to be able to work:

+1


source


Two questions:

How long does this request take?

And are you sure it sorts the date? And where in the plan does he sort the date? Then partitions? That would be my guess. I would doubt this is like the first thing he does ... Perhaps the way he parries or groups him needs to be pretended again.

Anyway, even if he was sorting an already sorted list, he would not think it would take very long, because it is sorted by alredy ...

0


source







All Articles