Search 300 million addresses with pg_trgm

Question

Search 300 million addresses with pg_trgm

I have 300 million addresses in my PostgreSQL 9.3 DB and I want to use pg_trgm to fuzzy search strings. The ultimate goal is to implement a search function similar to Google Maps search.

When I used pg_trgm to find these addresses, it took about 30 seconds to get the results. There are many lines that follow the default 0.3 standard, but I only need 5 or 10 results. I created a GiST trigram index:

CREATE INDEX addresses_trgm_index ON addresses USING gist (address gist_trgm_ops);

This is my request:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM addresses 
WHERE address % '981 maun st' 
ORDER BY sml DESC 
LIMIT 10;

The production environment test table has been deleted. I am displaying output EXPLAIN

from my test environment. There are about 7 million rows and it takes about 1.6 to get the results. With 300 million, it takes more than 30 seconds.

ebdb=> explain analyse select address, similarity(address, '781 maun st') as sml from addresses where address % '781 maun st' order by sml desc limit 10;
                                    QUERY PLAN                                                                            
————————————————————————————————————————————————————————————————————————————————    
 Limit  (cost=7615.83..7615.86 rows=10 width=16) (actual time=1661.004..1661.010 rows=10 loops=1)
 ->  Sort  (cost=7615.83..7634.00 rows=7268 width=16) (actual time=1661.003..1661.005 rows=10 loops=1)
     Sort Key: (similarity((address)::text, '781 maun st'::text))
     Sort Method: top-N heapsort  Memory: 25kB
     ->  Index Scan using addresses_trgm_index on addresses  (cost=0.41..7458.78 rows=7268 width=16) (actual time=0.659..1656.386 rows=5241 loops=1)
           Index Cond: ((address)::text % '781 maun st'::text)
 Total runtime: 1661.066 ms
(7 rows)

Is there a good way to improve performance or is this a good plan for partitioning tables?

+3

pattern-matching postgresql nearest-neighbor bigdata pg-trgm

Gary tao June 27. 17 at 6:16 am

source to share

1 answer

Erwin Brandstetter · Answer 1 · 2017-06-30T03:36:33+0000

PostgreSQL 9.3 ... Is there a good way to improve performance or is this a good plan for partitioning tables?

Splitting tables won't help at all.

But yes, there is a good way: Upgrade to the current Postgres version. Many improvements have been made to GiST indexes, pg_trgm in particular, and big data in general. Should be significantly faster with Postgres 9.6 or the upcoming Postgres 10 (currently in beta).

Nearest Neighbor looks correct, but for a small LIMIT

one, use this equivalent query instead:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM   addresses 
WHERE  address % '981 maun st' 
ORDER  BY address &lt-> '981 maun st'
LIMIT  10;

Quote from the manual:

Usually will beat the first wording when only a small number of you need the closest matches.

Search 300 million addresses with pg_trgm

More articles: