Slow query with single / group on varchar column with Postgres
I have a table company
and a table industry
, a many-to-many link table linking the two, with a name company_industry
. The table company
currently has approximately 750,000 rows.
Now I need a query that will find all the unique city names for a given industry that have at least one company. So basically I have to find all the companies that are related to a given industry and choose unique city names for those companies.
I can write queries that do this just fine, but not with the performance I'm looking for. I was a bit skeptical about performance beforehand because the column city_name
is of type VARCHAR
. Unfortunately, I am currently unable to change the database schema to something more normalized.
The first thing I did was add an index to the column city_name
and then I tried the following queries.
SELECT c.city_name AS city
FROM industry AS i
INNER JOIN company_industry AS ci ON (ci.industry_id = i.id)
INNER JOIN company AS c ON (c.id = ci.company_id)
WHERE i.id = 288
GROUP BY city;
The above query takes about two seconds to complete on average. The same thing happens when you replace GROUP BY
with DISTINCT
. Below is the execution plan for the above query.
HashAggregate (cost=56934.21..56961.61 rows=2740 width=9) (actual time=2421.364..2421.921 rows=1962 loops=1)
-> Hash Join (cost=38972.69..56902.50 rows=12687 width=9) (actual time=954.377..2411.194 rows=12401 loops=1)
Hash Cond: (ci.company_id = c.id)
-> Nested Loop (cost=0.28..13989.91 rows=12687 width=4) (actual time=0.041..203.442 rows=12401 loops=1)
-> Index Only Scan using industry_pkey on industry i (cost=0.28..8.29 rows=1 width=4) (actual time=0.015..0.018 rows=1 loops=1)
Index Cond: (id = 288)
Heap Fetches: 0
-> Seq Scan on company_industry ci (cost=0.00..13854.75 rows=12687 width=8) (actual time=0.020..199.087 rows=12401 loops=1)
Filter: (industry_id = 288)
Rows Removed by Filter: 806309
-> Hash (cost=26036.52..26036.52 rows=744152 width=13) (actual time=954.113..954.113 rows=744152 loops=1)
Buckets: 4096 Batches: 64 Memory Usage: 551kB
-> Seq Scan on company c (cost=0.00..26036.52 rows=744152 width=13) (actual time=0.008..554.662 rows=744152 loops=1)
Total runtime: 2422.185 ms
I tried changing the query to use a subquery as shown below, which made the query about twice as fast.
SELECT c.city_name
FROM company AS c
WHERE EXISTS(
SELECT 1
FROM company_industry
WHERE industry_id = 288 AND company_id = c.id
)
GROUP BY c.city_name;
And the execution plan for this query:
HashAggregate (cost=47108.71..47136.11 rows=2740 width=9) (actual time=1270.171..1270.798 rows=1962 loops=1)
-> Hash Semi Join (cost=14015.50..47076.98 rows=12690 width=9) (actual time=194.548..1251.785 rows=12401 loops=1)
Hash Cond: (c.id = company_industry.company_id)
-> Seq Scan on company c (cost=0.00..26036.52 rows=744152 width=13) (actual time=0.008..537.856 rows=744152 loops=1)
-> Hash (cost=13856.88..13856.88 rows=12690 width=4) (actual time=194.399..194.399 rows=12401 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 436kB
-> Seq Scan on company_industry (cost=0.00..13856.88 rows=12690 width=4) (actual time=0.012..187.449 rows=12401 loops=1)
Filter: (industry_id = 288)
Rows Removed by Filter: 806309
Total runtime: 1271.030 ms
It's better, but hopefully you guys can help me do better.
Basically, the expensive part of the query seems to find unique city names (as expected) and even with the index on the column, the performance isn't exactly good. I'm pretty rusty when it comes to analyzing execution plans, but I've included them so you guys can see exactly what's going on.
What can I do to recover this data faster?
I am using Postgres 9.3.5 , DDL is below:
CREATE TABLE company (
id SERIAL PRIMARY KEY NOT NULL,
name VARCHAR(150) NOT NULL,
city_name VARCHAR(50),
);
CREATE TABLE company_industry (
company_id INT NOT NULL REFERENCES company (id) ON UPDATE CASCADE,
industry_id INT NOT NULL REFERENCES industry (id) ON UPDATE CASCADE,
PRIMARY KEY (company_id, industry_id)
);
CREATE TABLE industry (
id SERIAL PRIMARY KEY NOT NULL,
name VARCHAR(100) NOT NULL
);
CREATE INDEX company_city_name_index ON company (city_name);
source to share
Both query plans have Seq Scan on company_industry
which should be a true (raster) index scan. The same applies to Seq Scan on company
.
It seems to be a problem with missing indexes - or something is wrong in your db. If something seems to be wrong in your DB, draw a backup before proceeding. Check if the parameters and cost statistics are actually set:
If the settings are good, I would then check the corresponding indices (as described below). Perhaps simple
REINDEX TABLE company;
REINDEX TABLE company_industry;
will fix it, maybe you need to do more:
Alternatively, you can simplify your query:
SELECT c.city_name AS city
FROM company_industry ci
JOIN company c ON c.id = ci.company_id
WHERE ci.industry_id = 288
GROUP BY 1;
Notes
-
If your PK constraint is on
(company_id, industry_id)
, add another (unique) index in(industry_id, company_id)
(reverse order!). Why? -
Seq Scan on company
equally annoying. There seems to be no index oncompany(id)
, but your ER diagram is pointing to PK, so can't that be?
The fastest option is to have a multi-column index on(id, city_name)
- if (and only if) you can only get the index from it. -
Since you already have the industry ID, you don't need to include the table at all
industry
. -
There is no need for parentheses around the expression (s) in the sentence
ON
. -
This is unfortunate:
Unfortunately, I currently have no way to change the database schema to something more normalized.
Your simple schema makes sense for small tables with little redundancy and is unlikely to strain the available cache. But city names are probably very redundant in large tables. Normalizing will significantly reduce the size of tables and indexes, which is the most important performance factor.
A warped shape with redundant storage can sometimes lead to performance gains for targeted queries, sometimes it doesn't. But it always affects everything else. The backing store is eating more than your available cache, so other data should be out of the cache earlier. Even if you get something locally, you lose it altogether.
In this particular case, it would be significantly cheaper to get different values ββfor the columncity_id int
, since the valuesinteger
smaller and faster to compare than (potentially long) strings. A multi-column index for(id, city_id)
a company will be smaller than the same for(id, city_name)
and faster to process. Another connection after folding many duplicates is comparatively cheap.If you want maximum performance, you can always add
MATERIALIZED VIEW
for special purpose with precomputed results (easily aggregated and indexed onindustry_id
), but not store information that would be massively redundant in your table calculation.
source to share
If you need this query in the range of milliseconds, you must de-normalize your data by adding another city_name column to the company_industry transition table and index it.
this way you will only ask (no validation)
SELECT DISTINCT(c.city_name)
FROM company_industry ci
GROUP BY ci.industry_id
HAVING COUNT(ci.company_id) >= 1
source to share