Relevant Postgres index for sorting and joining
I have a simple schema and query, but I am experiencing constant terrible performance with certain parameters.
Scheme
CREATE TABLE locations (
id integer NOT NULL,
barcode_id integer NOT NULL
);
CREATE TABLE barcodes (
id integer NOT NULL,
value citext NOT NULL
);
ALTER TABLE ONLY locations ADD CONSTRAINT locations_pkey PRIMARY KEY (id);
ALTER TABLE ONLY barcodes ADD CONSTRAINT barcodes_pkey PRIMARY KEY (id);
ALTER TABLE ONLY locations ADD CONSTRAINT fk_locations_barcodes FOREIGN KEY (barcode_id) REFERENCES barcodes(id);
CREATE INDEX index_barcodes_on_value ON barcodes (value);
CREATE INDEX index_locations_on_barcode_id ON locations (barcode_id);
Query
EXPLAIN ANALYZE
SELECT *
FROM locations
JOIN barcodes ON locations.barcode_id = barcodes.id
ORDER BY barcodes.value ASC
LIMIT 50;
Analysis :
Limit (cost=0.71..3564.01 rows=50 width=34) (actual time=0.043..683.025 rows=50 loops=1)
-> Nested Loop (cost=0.71..4090955.00 rows=57404 width=34) (actual time=0.043..683.017 rows=50 loops=1)
-> Index Scan using index_barcodes_on_value on barcodes (cost=0.42..26865.99 rows=496422 width=15) (actual time=0.023..218.775 rows=372138 loops=1)
-> Index Scan using index_locations_on_barcode_id on locations (cost=0.29..5.32 rows=287 width=8) (actual time=0.001..0.001 rows=0 loops=372138)
Index Cond: (barcode_id = barcodes.id)
Planning time: 0.167 ms
Execution time: 683.078 ms
500 + ms for the number of records in my schema (500,000 barcodes and 60,000 locations) doesn't make sense. Is there anything I can do to improve performance?
Note:
Even a stranger - the execution time depends on the data. In developing this question, I tried to include random data, but the seeds seem to be performative:
Seed:
INSERT INTO barcodes (id, value) SELECT seed.id, gen_random_uuid() FROM generate_series(1,500000) AS seed(id);
INSERT INTO locations (id, barcode_id) SELECT seed.id, (RANDOM() * 500000) FROM generate_series(1,60000) AS seed(id);
Analysis:
Limit (cost=0.71..3602.63 rows=50 width=86) (actual time=0.089..1.123 rows=50 loops=1)
-> Nested Loop (cost=0.71..4330662.42 rows=60116 width=86) (actual time=0.088..1.115 rows=50 loops=1)
-> Index Scan using index_barcodes_on_value on barcodes (cost=0.42..44972.42 rows=500000 width=41) (actual time=0.006..0.319 rows=376 loops=1)
-> Index Scan using index_locations_on_barcode_id on locations (cost=0.29..5.56 rows=301 width=8) (actual time=0.002..0.002 rows=0 loops=376)
Index Cond: (barcode_id = barcodes.id)
Planning time: 0.213 ms
Execution time: 1.152 ms
Edit
Analysis of tables:
ANALYZE VERBOSE barcodes;
INFO: analyzing "public.barcodes"
INFO: "barcodes": scanned 2760 of 2760 pages, containing 496157 live
rows and 0 dead rows; 30000 rows in sample, 496157 estimated total rows
ANALYZE
Time: 62.937 ms
ANALYZE VERBOSE locations;
INFO: analyzing "public.locations"
INFO: "locations": scanned 254 of 254 pages, containing 57394 live rows
and 0 dead rows; 30000 rows in sample, 57394 estimated total rows
ANALYZE
Time: 21.447 ms
The problem is that barcodes
there are no matches with low values ββin locations
which PostgreSQL is not aware. Thus, his plan for getting barcodes
in the correct output order through the index and then concatenating the values ββfrom locations
until he finds 50 of them is much worse than expected.
I would ANALYZE
tables and
DROP INDEX index_barcodes_on_value;
This should force PostgreSQL to choose this plan.
I don't know which plan PostgreSQL will choose. For a nested loop, the following index can help:
CREATE INDEX ON locations(id);