Postgres affinity query optimization (pg_trgm + gin index)

Question

Postgres affinity query optimization (pg_trgm + gin index)

I have defined the following index:

CREATE INDEX
    users_search_idx
ON
    auth_user
USING
    gin(
        username gin_trgm_ops,
        first_name gin_trgm_ops,
        last_name gin_trgm_ops
    );

I am making the following request:

PREPARE user_search (TEXT, INT) AS
    SELECT
        username,
        email,
        first_name,
        last_name,
        ( -- would probably do per-field weightings here
            s_username + s_first_name + s_last_name
        ) rank
    FROM
        auth_user,
        similarity(username, $1) s_username,
        similarity(first_name, $1) s_first_name,
        similarity(last_name, $1) s_last_name
    WHERE
        username % $1 OR
        first_name % $1 OR
        last_name % $1
    ORDER BY
        rank DESC
    LIMIT $2;

The table auth_user

has 6.2 million rows.

The query speed seems to be very dependent on the number of results potentially returned by the query similarity

.

Increasing the similarity threshold with a set_limit

helps, but reduces the usefulness of the results by eliminating overlaps.

Some searches return after 200ms, others take ~ 10 seconds.

We have an existing implementation of this function using Elasticsearch that returns <200ms for any query, while doing more complex (better) rankings.

I would like to know if there is a way to improve this in order to get more consistent performance?

As I understand it, GIN (Inverted Index) index is the same basic approach used by Elasticsearch, so I would think some optimization is possible.

EXPLAIN ANALYZE EXECUTE user_search('mel', 20)

shows:

Limit  (cost=54099.81..54099.86 rows=20 width=52) (actual time=10302.092..10302.104 rows=20 loops=1)
  ->  Sort  (cost=54099.81..54146.66 rows=18739 width=52) (actual time=10302.091..10302.095 rows=20 loops=1)
        Sort Key: (((s_username.s_username + s_first_name.s_first_name) + s_last_name.s_last_name)) DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  Nested Loop  (cost=382.74..53601.17 rows=18739 width=52) (actual time=118.164..10293.765 rows=8380 loops=1)
              ->  Nested Loop  (cost=382.74..53132.69 rows=18739 width=56) (actual time=118.150..10262.804 rows=8380 loops=1)
                    ->  Nested Loop  (cost=382.74..52757.91 rows=18739 width=52) (actual time=118.142..10233.990 rows=8380 loops=1)
                          ->  Bitmap Heap Scan on auth_user  (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
                                Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
                                Rows Removed by Index Recheck: 2434523
                                Heap Blocks: exact=49337 lossy=53104
                                ->  BitmapOr  (cost=382.74..382.74 rows=18757 width=0) (actual time=107.436..107.436 rows=0 loops=1)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=40.200..40.200rows=88908 loops=1)"
                                            Index Cond: ((username)::text % 'mel'::text)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=43.847..43.847rows=102028 loops=1)"
                                            Index Cond: ((first_name)::text % 'mel'::text)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=23.387..23.387rows=58740 loops=1)"
                                            Index Cond: ((last_name)::text % 'mel'::text)
                          ->  Function Scan on similarity s_username  (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=8380)
                    ->  Function Scan on similarity s_first_name  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
              ->  Function Scan on similarity s_last_name  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
Execution time: 10302.559 ms

Postgres 9.6.1 server runs on Amazon RDS

Refresh

1.

Shortly after posting the question, I found this information: https://www.postgresql.org/message-id/ 464F3C5D.2000700@enterprisedb.com

So i tried

-> SHOW work_mem;
4MB
-> SET work_mem='12MB';
-> EXECUTE user_search('mel', 20);
(results returned in ~1.5s)

This made a big improvement (previously> 10s)!

1.5s is still much slower than ES for a similar query, so I would still love to hear any suggestions for optimizing the query.

2.

In response to comments and after looking at this question ( Postgresql gg index is slower than GIST for pg_trgm ), I tried exactly the same setup with a GIST index instead of a GIN index.

Trying the same search above, it came back ~ 3.5s later using default work_mem='4MB'

. The magnification work_mem

doesn't matter.

From this I conclude that the GIST index is more memory efficient (did not touch the pathological case like GIN), but slower than GIN when GIN works as expected. This is in line with what is described in the documentation recommending the GIN index.

3.

I still don't understand why it takes so long to:

 ->  Bitmap Heap Scan on auth_user  (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
     Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
     Rows Removed by Index Recheck: 2434523
     Heap Blocks: exact=49337 lossy=53104

I don't understand why this step is needed or what it does.

BitmapOr

three Bitmap Index Scan

for each of the sentences with username % $1

... these results are then combined with a step BitmapOr

. These parts are all pretty fast.

But even if we don't run out of work, we still spend almost a whole second in Bitmap Heap Scan

.

+6

postgresql postgresql-9.6 similarity trigram pg-trgm

Anentropic May 09 '17 at 10:33

source to share

1 answer

Erwin Brandstetter · Answer 1 · 2017-07-01T02:16:13+0000

I expect much faster results with this approach:

1.

Create a GiST index with 1 column containing the concatenated values:

CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);

This assumes all 3 columns must be defined as NOT NULL

(you did not specify). Otherwise, you need to do more.
Why not simplify with concat_ws()

?

2.

Use the correct query nearest-neighborcorresponding to the above index:

SELECT username, email, first_name, last_name
     , similarity(username  , $1) AS s_username
     , similarity(first_name, $1) AS s_first_name
     , similarity(last_name , $1) AS s_last_name
     , row_number() OVER () AS rank  -- greatest similarity first
FROM   auth_user
WHERE     (username || ' ' || first_name || ' ' || last_name) %   $1  -- !!
ORDER  BY (username || ' ' || first_name || ' ' || last_name) <-> $1  -- !!
LIMIT  $2;

Expressions in WHERE

and ORDER BY

must match an index expression!

In particular ORDER BY rank

(like yours) it will always work poorly for a small selection LIMIT

from a much larger pool of qualifying strings, since it cannot directly use the index: a complex expression for rank

must be calculated for each qualifying string, then everything must be sorted before it can will select a small selection of the best matches. This is much, much more expensive than a real nearest neighbor query, which can get better results from the index directly without looking at the rest.

row_number()

with an empty window definition simply reflects the order produced by ORDER BY

the same SELECT

.

Postgres affinity query optimization (pg_trgm + gin index)

Refresh

1.

2.

More articles: