Search 300 million addresses with pg_trgm
I have 300 million addresses in my PostgreSQL 9.3 DB and I want to use pg_trgm to fuzzy search strings. The ultimate goal is to implement a search function similar to Google Maps search.
When I used pg_trgm to find these addresses, it took about 30 seconds to get the results. There are many lines that follow the default 0.3 standard, but I only need 5 or 10 results. I created a GiST trigram index:
CREATE INDEX addresses_trgm_index ON addresses USING gist (address gist_trgm_ops);
This is my request:
SELECT address, similarity(address, '981 maun st') AS sml
FROM addresses
WHERE address % '981 maun st'
ORDER BY sml DESC
LIMIT 10;
The production environment test table has been deleted. I am displaying output EXPLAIN
from my test environment. There are about 7 million rows and it takes about 1.6 to get the results. With 300 million, it takes more than 30 seconds.
ebdb=> explain analyse select address, similarity(address, '781 maun st') as sml from addresses where address % '781 maun st' order by sml desc limit 10;
QUERY PLAN
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Limit (cost=7615.83..7615.86 rows=10 width=16) (actual time=1661.004..1661.010 rows=10 loops=1)
-> Sort (cost=7615.83..7634.00 rows=7268 width=16) (actual time=1661.003..1661.005 rows=10 loops=1)
Sort Key: (similarity((address)::text, '781 maun st'::text))
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using addresses_trgm_index on addresses (cost=0.41..7458.78 rows=7268 width=16) (actual time=0.659..1656.386 rows=5241 loops=1)
Index Cond: ((address)::text % '781 maun st'::text)
Total runtime: 1661.066 ms
(7 rows)
Is there a good way to improve performance or is this a good plan for partitioning tables?
source to share
PostgreSQL 9.3 ... Is there a good way to improve performance or is this a good plan for partitioning tables?
Splitting tables won't help at all.
But yes, there is a good way: Upgrade to the current Postgres version. Many improvements have been made to GiST indexes, pg_trgm in particular, and big data in general. Should be significantly faster with Postgres 9.6 or the upcoming Postgres 10 (currently in beta).
Nearest Neighbor looks correct, but for a small LIMIT
one, use this equivalent query instead:
SELECT address, similarity(address, '981 maun st') AS sml
FROM addresses
WHERE address % '981 maun st'
ORDER BY address <-> '981 maun st'
LIMIT 10;
Usually will beat the first wording when only a small number of you need the closest matches.
source to share