How column cross-correlation is handled by PostgreSQL Optimizer

There is often a relationship or correlation between stored data in different columnsthe same table. For example, in the Customers table, the values ​​in the c_state column depend on the values ​​in the country_id column, since the XYZ state will only be found in country ABC.

I think PostgreSQL assumes that predicates are independent of each other , and selectivity for each predicate in the same ratio is multiplied together. Consequently, the selectivity estimates will be much smaller than the actual ones, and non-optimal access paths can be chosen when the data is highly dependent and skewed. How can I avoid this in PostgreSQL?

Is it possible to create multi-column statistical statistics in a column group in PostgreSQL 9.3.5. Is there support for Multidimensional Histograms ?

+3


source to share


2 answers


You're right, Pg takes independence and that can be a problem when there is correlation. The analyzer does not know how to find correlations, and the optimizer does not know how to use such information, even if the analyzer could find it. No support for multivariate histograms or storing correlation information between columns.

There has been a lot of discussion about this in pgsql-hackers and pgsql-generic lists, but no firm conclusions on how to deal with it have been reached. Plus, almost no one who has a problem with this is willing to spend the time (or funding) actually solving the problem.

Here's an up-to-date recent article . The Optimizer Tips wiki page also covers some correlation issues.



In some of the mailing list discussions (a highly non-exhaustive list):

+7


source


Postgres 10

Update: Postgres 10 gets cross-column statistics, extended aka statistics, correlated aka statistics

When searching / sorting / grouping on columns that are related to each other, such as ZipCode and City, tell Postgres to examine the data for better efficiency with the planner.

CREATE STATISTICS zip_stats_ ( dependencies )
ON zip_ , city_
FROM zipcode_ ;

      

Database normalization is about eliminating functional dependencies between values ​​in a row. But sometimes they remain due to de-normalization, either intentionally for performance or unwittingly. And sometimes we have a partial dependency like ZipCode and City. In such cases, if you are using these correlated values ​​when searching, sorting, or grouping, first create cross-column statistics using the command CREATE STATISTICS

.

Syntax



The syntax is simple.

Currently, there are two types of statistics, either dependencies

for the analysis of the calculation of functional dependence, where 1.0 means direct correlation, and ndistinct

for n-different analysis of individual pairs of values. Omitting types means that you want all possible types that may be greater than these two in the future.

CREATE STATISTICS [ IF NOT EXISTS ] NAME 
[ type = `dependencies` or `ndistinct` or omit for all types ]
ON col_x , col_y , …
FROM tbl ;

ALTER STATISTICS NAME OWNER TO …
ALTER STATISTICS NAME RENAME TO …
ALTER STATISTICS NAME SET SCHEMA …

DROP STATISTICS [ IF EXISTS ] name

      

Further reading

+1


source







All Articles