How column cross-correlation is handled by PostgreSQL Optimizer

Question

How column cross-correlation is handled by PostgreSQL Optimizer

There is often a relationship or correlation between stored data in different columnsthe same table. For example, in the Customers table, the values in the c_state column depend on the values in the country_id column, since the XYZ state will only be found in country ABC.

I think PostgreSQL assumes that predicates are independent of each other , and selectivity for each predicate in the same ratio is multiplied together. Consequently, the selectivity estimates will be much smaller than the actual ones, and non-optimal access paths can be chosen when the data is highly dependent and skewed. How can I avoid this in PostgreSQL?

Is it possible to create multi-column statistical statistics in a column group in PostgreSQL 9.3.5. Is there support for Multidimensional Histograms ?

+3

sql database postgresql postgresql-9.3

CRM 12 Sep At 16:01

source to share

2 answers

Craig ringer · Answer 1 · 2014-09-12T16:31:23+0000

You're right, Pg takes independence and that can be a problem when there is correlation. The analyzer does not know how to find correlations, and the optimizer does not know how to use such information, even if the analyzer could find it. No support for multivariate histograms or storing correlation information between columns.

There has been a lot of discussion about this in pgsql-hackers and pgsql-generic lists, but no firm conclusions on how to deal with it have been reached. Plus, almost no one who has a problem with this is willing to spend the time (or funding) actually solving the problem.

Here's an up-to-date recent article . The Optimizer Tips wiki page also covers some correlation issues.

In some of the mailing list discussions (a highly non-exhaustive list):

how to implement selectivity in postgresql - recent
Cross-column correlation
How to specify / mock table statistics in PostgreSQL

Basil bourque · Answer 2 · 2017-07-25T01:04:22+0000

Postgres 10

Update: Postgres 10 gets cross-column statistics, extended aka statistics, correlated aka statistics

When searching / sorting / grouping on columns that are related to each other, such as ZipCode and City, tell Postgres to examine the data for better efficiency with the planner.

CREATE STATISTICS zip_stats_ ( dependencies )
ON zip_ , city_
FROM zipcode_ ;

Database normalization is about eliminating functional dependencies between values in a row. But sometimes they remain due to de-normalization, either intentionally for performance or unwittingly. And sometimes we have a partial dependency like ZipCode and City. In such cases, if you are using these correlated values when searching, sorting, or grouping, first create cross-column statistics using the command CREATE STATISTICS

.

Syntax

The syntax is simple.

Currently, there are two types of statistics, either dependencies

for the analysis of the calculation of functional dependence, where 1.0 means direct correlation, and ndistinct

for n-different analysis of individual pairs of values. Omitting types means that you want all possible types that may be greater than these two in the future.

CREATE STATISTICS [ IF NOT EXISTS ] NAME 
[ type = `dependencies` or `ndistinct` or omit for all types ]
ON col_x , col_y , …
FROM tbl ;

ALTER STATISTICS NAME OWNER TO …
ALTER STATISTICS NAME RENAME TO …
ALTER STATISTICS NAME SET SCHEMA …

DROP STATISTICS [ IF EXISTS ] name

How column cross-correlation is handled by PostgreSQL Optimizer

Postgres 10

Syntax

Further reading

More articles: