Timecomplexity of built-in SQL functions like sum, count, avg

What is the time complexity of a function like count, sum, avg or any of the built-in "math" functions in mysql, sql server, oracle and others?

You might think that calling sum (myColumn) would be Linear.

But score (1) is not. Why and what is the complexity of real time?

In a perfect world, I would like the sum, avg and to be considered O (1). But we don't live in one of them, do we?

+2


source to share


4 answers


In SQL, the aggregate complexity mathematical function is fully relevant. The only thing that really matters is the complexity of the data: which access path is chosen (table scan, index range scan, index lookup, etc.) and how many pages are read. There may be slight differences within each aggregate, but they all work pretty much the same (keep the current state and calculate rolling aggregation for each input value) and there is absolutely no aggregate that looks at the input twice, so they are all O (n) as internal implementation. where "n" is the number of records submitted to the aggregate (not necessarily the number of records in the table!).



Some units have internal labels, for example. COUNT (*) can return a count from metadata on some systems if possible.

+2


source


What is the time complexity of a function like count, sum, avg or any of the built-in "math" functions in mysql, sql server, oracle and others?

  • As MySQL

    with MyISAM

    , COUNT(*)

    without GROUP BY

    there O(1)

    (constant)

    It is stored in the table metadata.

  • On all systems MAX

    and MIN

    for indexed expressions without GROUP BY

    O(log(n))

    (logarithmic).

    They are retrieved with one index lookup.

  • Aggregate functions O(n)

    (linear), when used without GROUP BY

    or GROUP BY

    usesHASH

  • Aggregate functions O(n log(n))

    when GROUP BY

    using SORT

    .



All values ​​must be retrieved, calculated and stored in state variables (which can be stored in a hash table).

In addition, SORT

they must be sorted when used .

+3


source


Note. This is an assumption based on my understanding of how SQL query planners work and may not be entirely accurate.

I believe that all the aggregate functions, or at least the "math" ones you named above, should be O (n). The request will be executed something like this:

  • Get rows matching join predicates and filter predicates (i.e. WHERE clause)
  • Creation of row groups according to the GROUP BY clause. One row group is created for queries without GROUP BY
  • For each group of rows, apply the aggregate function to the rows in the group. For things like SUM, AVG, MIN, MAX, as well as non-numeric functions like CONCAT, there are simple O (n) algorithms and I suspect they are used. Create one row in the output set for each row group created in step # 2
  • If the predicate HAVING is present, filter the output rows using this predicate

Note, however, that although the aggregated functions are O (n), the operation may not be. If you create a query that decartes joins the table to itself, you will look for a minimum of O (n * n) just to create the initial rowset (step # 1). Sorting to create groups of rows (step # 2) can be O (n lg n) and can require disk storage for a sort operation (as opposed to a RAM operation), so your query may still not perform well if you are manipulating many rows ...

+1


source


For large queries such as data warehouses, the underlying databases can parallelize the task, so multiple processors are running on it. Hence, there will be cut-off points where it is not quite linear, as the cost of coordinating parallel streams is traded with the benefit of using multiple processors.

0


source







All Articles