Timecomplexity of built-in SQL functions like sum, count, avg
What is the time complexity of a function like count, sum, avg or any of the built-in "math" functions in mysql, sql server, oracle and others?
You might think that calling sum (myColumn) would be Linear.
But score (1) is not. Why and what is the complexity of real time?
In a perfect world, I would like the sum, avg and to be considered O (1). But we don't live in one of them, do we?
source to share
In SQL, the aggregate complexity mathematical function is fully relevant. The only thing that really matters is the complexity of the data: which access path is chosen (table scan, index range scan, index lookup, etc.) and how many pages are read. There may be slight differences within each aggregate, but they all work pretty much the same (keep the current state and calculate rolling aggregation for each input value) and there is absolutely no aggregate that looks at the input twice, so they are all O (n) as internal implementation. where "n" is the number of records submitted to the aggregate (not necessarily the number of records in the table!).
Some units have internal labels, for example. COUNT (*) can return a count from metadata on some systems if possible.
source to share
What is the time complexity of a function like count, sum, avg or any of the built-in "math" functions in mysql, sql server, oracle and others?
-
As
MySQL
withMyISAM
,COUNT(*)
withoutGROUP BY
thereO(1)
(constant)It is stored in the table metadata.
-
On all systems
MAX
andMIN
for indexed expressions withoutGROUP BY
O(log(n))
(logarithmic).They are retrieved with one index lookup.
-
Aggregate functions
O(n)
(linear), when used withoutGROUP BY
orGROUP BY
usesHASH
-
Aggregate functions
O(n log(n))
whenGROUP BY
usingSORT
.
All values ββmust be retrieved, calculated and stored in state variables (which can be stored in a hash table).
In addition, SORT
they must be sorted when used .
source to share
Note. This is an assumption based on my understanding of how SQL query planners work and may not be entirely accurate.
I believe that all the aggregate functions, or at least the "math" ones you named above, should be O (n). The request will be executed something like this:
- Get rows matching join predicates and filter predicates (i.e. WHERE clause)
- Creation of row groups according to the GROUP BY clause. One row group is created for queries without GROUP BY
- For each group of rows, apply the aggregate function to the rows in the group. For things like SUM, AVG, MIN, MAX, as well as non-numeric functions like CONCAT, there are simple O (n) algorithms and I suspect they are used. Create one row in the output set for each row group created in step # 2
- If the predicate HAVING is present, filter the output rows using this predicate
Note, however, that although the aggregated functions are O (n), the operation may not be. If you create a query that decartes joins the table to itself, you will look for a minimum of O (n * n) just to create the initial rowset (step # 1). Sorting to create groups of rows (step # 2) can be O (n lg n) and can require disk storage for a sort operation (as opposed to a RAM operation), so your query may still not perform well if you are manipulating many rows ...
source to share
For large queries such as data warehouses, the underlying databases can parallelize the task, so multiple processors are running on it. Hence, there will be cut-off points where it is not quite linear, as the cost of coordinating parallel streams is traded with the benefit of using multiple processors.
source to share