# How column cross-correlation is handled by PostgreSQL Optimizer

There is often a relationship or **correlation between stored data** in different columnsthe same table. For example, in the Customers table, the values ββin the c_state column depend on the values ββin the country_id column, since the XYZ state will only be found in country ABC.

I think PostgreSQL assumes that predicates are **independent of each other** , and selectivity for each predicate in the same ratio **is multiplied** together. Consequently, the selectivity estimates will be much smaller than the actual ones, and **non-optimal** access paths can be chosen when the data is highly dependent and skewed. How can I avoid this in PostgreSQL?

Is it possible to create **multi-column** statistical statistics **in a column** group in PostgreSQL 9.3.5. Is there support for **Multidimensional Histograms** ?

source to share

You're right, Pg takes independence and that can be a problem when there is correlation. The analyzer does not know how to find correlations, and the optimizer does not know how to use such information, even if the analyzer could find it. No support for multivariate histograms or storing correlation information between columns.

There has been a lot of discussion about this in pgsql-hackers and pgsql-generic lists, but no firm conclusions on how to deal with it have been reached. Plus, almost no one who has a problem with this is willing to spend the time (or funding) actually solving the problem.

Here's an up-to-date recent article . The Optimizer Tips wiki page also covers some correlation issues.

In some of the mailing list discussions (a highly non-exhaustive list):

source to share

# Postgres 10

**Update:** Postgres 10 gets cross-column statistics, extended aka statistics, correlated aka statistics

When searching / sorting / grouping on columns that are related to each other, such as ZipCode and City, tell Postgres to examine the data for better efficiency with the planner.

```
CREATE STATISTICS zip_stats_ ( dependencies )
ON zip_ , city_
FROM zipcode_ ;
```

Database normalization is about eliminating functional dependencies between values ββin a row. But sometimes they remain due to de-normalization, either intentionally for performance or unwittingly. And sometimes we have a partial dependency like ZipCode and City. In such cases, if you are using these correlated values ββwhen searching, sorting, or grouping, first create cross-column statistics using the command `CREATE STATISTICS`

.

# Syntax

The syntax is simple.

Currently, there are two types of statistics, either `dependencies`

for the analysis of the calculation of functional dependence, where 1.0 means direct correlation, and `ndistinct`

for n-different analysis of individual pairs of values. Omitting types means that you want all possible types that may be greater than these two in the future.

```
CREATE STATISTICS [ IF NOT EXISTS ] NAME
[ type = `dependencies` or `ndistinct` or omit for all types ]
ON col_x , col_y , β¦
FROM tbl ;
ALTER STATISTICS NAME OWNER TO β¦
ALTER STATISTICS NAME RENAME TO β¦
ALTER STATISTICS NAME SET SCHEMA β¦
DROP STATISTICS [ IF EXISTS ] name
```

# Further reading

- PG Phriday: Crazy Correlated Column Crusade by Shaun Thomas, 2017-07
- Cross Columns Stats Postgres wiki
- Manual, Chapter 14.2 Statistics used by the Scheduler
- Manual, Chapter 68 Using the Statistics Planner
- Wikipedia: Functional addiction

source to share