Improve query efficiency for duplicate queries

Question

Improve query efficiency for duplicate queries

I am writing a node.js application to enable search on a PostgreSQL database. To enable the tweeter-type in the search box, I need to spell out a set of keywords from the database to initialize Bloodhound before the page loads. It's something like below:

SELECT distinct handlerid from lotintro where char_length(lotid)=7;

So for a large table (lotintro) it is expensive; this is also silly, since the query result is likely to remain the same for different web visitors over a period of time.

What is the correct way to handle this? I think there are several options:

1) Place the request in a stored procedure and call it from node.js:

   SELECT * from getallhandlerid()

Does this mean that the query will compile and the database will automatically return the same result sets without actually executing the query, knowing that the result has not changed?

2) Or create a separate table to store the selection handlerid

and update the table with a trigger that runs every day? (I know ideally a trigger should fire for every insert / update to the table, but it's too expensive).

3) create a partial index as suggested. Here's what came together:

Request

SELECT distinct handlerid from lotintro where length(lotid) = 7;

Index

CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE  length(lotid) = 7;

With an index, the query cost is about 250ms, try running

explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=5542.64..5542.65 rows=1 width=6) (actual rows=151 loops=1)"
"  ->  Bitmap Heap Scan on lotintro  (cost=39.08..5537.50 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Recheck Cond: (length(lotid) = 7)"
"        Rows Removed by Index Recheck: 55285"
"        ->  Bitmap Index Scan on lotid7_idx  (cost=0.00..38.57 rows=2056 width=0) (actual rows=298350 loops=1)"
"Total runtime: 243.686 ms"

Without the index, the query cost is about 210ms, try running

explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
"  ->  Seq Scan on lotintro  (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Filter: (length(lotid) = 7)"
"        Rows Removed by Filter: 112915"
"Total runtime: 214.235 ms"

What am I doing wrong here?

4) Using suggested index and alexius query:

create index on lotintro using btree(char_length(lotid), handlerid);

But this is not the optimal solution. Since there are only a few different values, you can use the index reversal trick, which should work much faster in your case:

explain (analyze on, BUFFERS on, TIMING OFF)
WITH RECURSIVE t AS (
   (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1)  -- parentheses required
   UNION ALL
   SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
   FROM t
   WHERE t.handlerid IS NOT NULL
   )
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;

"CTE Scan on t  (cost=444.52..446.54 rows=100 width=32) (actual rows=151 loops=1)"
"  Filter: (handlerid IS NOT NULL)"
"  Rows Removed by Filter: 1"
"  Buffers: shared hit=608"
"  CTE t"
"    ->  Recursive Union  (cost=0.42..444.52 rows=101 width=32) (actual rows=152 loops=1)"
"          Buffers: shared hit=608"
"          ->  Limit  (cost=0.42..4.17 rows=1 width=6) (actual rows=1 loops=1)"
"                Buffers: shared hit=4"
"                ->  Index Scan using lotid_btree on lotintro lotintro_1  (cost=0.42..7704.41 rows=2056 width=6) (actual rows=1 loops=1)"
"                      Index Cond: (char_length(lotid) = 7)"
"                      Buffers: shared hit=4"
"          ->  WorkTable Scan on t t_1  (cost=0.00..43.83 rows=10 width=32) (actual rows=1 loops=152)"
"                Filter: (handlerid IS NOT NULL)"
"                Rows Removed by Filter: 0"
"                Buffers: shared hit=604"
"                SubPlan 1"
"                  ->  Limit  (cost=0.42..4.36 rows=1 width=6) (actual rows=1 loops=151)"
"                        Buffers: shared hit=604"
"                        ->  Index Scan using lotid_btree on lotintro  (cost=0.42..2698.13 rows=685 width=6) (actual rows=1 loops=151)"
"                              Index Cond: ((char_length(lotid) = 7) AND (handlerid > t_1.handlerid))"
"                              Buffers: shared hit=604"
"Planning time: 1.574 ms"
**"Execution time: 25.476 ms"**

========= More information about db ==================================== =

dataloggerDB = # \ d lotintro "public.lotintro" table

    Column    |            Type             |  Modifiers
 --------------+-----------------------------+--------------
  lotstartdt   | timestamp without time zone | not null
  lotid        | text                        | not null
  ftc          | text                        | not null
  deviceid     | text                        | not null
  packageid    | text                        | not null
  testprogname | text                        | not null
  testprogdir  | text                        | not null
  testgrade    | text                        | not null
  testgroup    | text                        | not null
  temperature  | smallint                    | not null
  testerid     | text                        | not null
  handlerid    | text                        | not null
  numofsite    | text                        | not null
  masknum      | text                        |
  soaktime     | text                        |
  xamsqty      | smallint                    |
  scd          | text                        |
  speedgrade   | text                        |
  loginid      | text                        |
  operatorid   | text                        | not null
  loadboardid  | text                        | not null
  checksum     | text                        |
  lotenddt     | timestamp without time zone | not null
  totaltest    | integer                     | default (-1)
  totalpass    | integer                     | default (-1)
  earnhour     | real                        | default 0
  avetesttime  | real                        | default 0
  Indexes:
  "pkey_lotintro" PRIMARY KEY, btree (lotstartdt, testerid)
  "lotid7_idx" btree (handlerid) WHERE length(lotid) = 7

your version of Postgres,         [PostgreSQL 9.2]
cardinalities (how many rows?),   [411K rows for table lotintro]
percentage for length(lotid) = 7. [298350/411000=  73%]

============== after transferring everything to PG 9.4 =====================

With index:

explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=5542.78..5542.79 rows=1 width=6) (actual rows=151 loops=1)"
"  Group Key: handlerid"
"  Buffers: shared hit=14242"
"  ->  Bitmap Heap Scan on lotintro  (cost=39.22..5537.64 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Recheck Cond: (length(lotid) = 7)"
"        Heap Blocks: exact=13313"
"        Buffers: shared hit=14242"
"        ->  Bitmap Index Scan on lotid7_idx  (cost=0.00..38.70 rows=2056 width=0) (actual rows=298350 loops=1)"
"              Buffers: shared hit=929"
"Planning time: 0.256 ms"
"Execution time: 154.657 ms"

Without index:

explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
"  Group Key: handlerid"
"  Buffers: shared hit=13316"
"  ->  Seq Scan on lotintro  (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Filter: (length(lotid) = 7)"
"        Rows Removed by Filter: 112915"
"        Buffers: shared hit=13316"
"Planning time: 0.168 ms"
"Execution time: 176.466 ms"

+3

triggers stored-procedures postgresql twitter-typeahead bloodhound

sqr May 09 '15 at 9:22

source to share

3 answers

You need to index the exact expression that was used in your proposal WHERE

: http://www.postgresql.org/docs/9.4/static/indexes-expressional.html

CREATE INDEX char_length_lotid_idx ON lotintro (char_length(lotid));

You can also create a function STABLE

or IMMUTABLE

, to wrap this request as you suggested: http://www.postgresql.org/docs/9.4/static/sql-createfunction.html

Your last suggestion is also viable what you are looking for MATERIALIZED VIEWS

: http://www.postgresql.org/docs/9.4/static/sql-creatematerializedview.html This will prevent the creation of a custom trigger to update the denormalized table.

+1

Clément Prévost May 09 '15 at 11:38

source to share

Since 3/4 of the lines satisfy your condition (length (lotid) = 7), the index alone won't help. You can get slightly better performance with this index because of the index-only scan:

create index on lotintro using btree(char_length(lotid), handlerid);

But this is not the optimal solution. Since there are only a few different values, you can use a trick called lost index scan , which should be much faster in your case:

WITH RECURSIVE t AS (
   (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1)  -- parentheses required
   UNION ALL
   SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
   FROM t
   WHERE t.handlerid IS NOT NULL
   )
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;

for this query you also need to create the index mentioned above.

+1

alexius May 11 '15 at 4:26

source to share

Erwin Brandstetter · Accepted Answer · 2015-05-09T14:13:12+0000

1)

No, the function does not save snapshots of the result in any way. There is some potential for performance optimization if you define a function STABLE

(which would be correct). In the documentation:

Function A STABLE

cannot modify the database and is guaranteed to return the same results as the same arguments for all rows within a single statement.

IMMUTABLE

would be wrong and could potentially cause errors.

So it can be very beneficial to use multiple calls within the same operator - but this is not appropriate for your use ...

And the plpgsql functions work like prepared statements giving you a similar bonus in a single session:

Difference between sql language and plpgsql language in PostgreSQL functions

2)

Try it . With or without MV (or some other caching method) partial index will be most efficient for your special case MATERIALIZED VIEW

CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE  length(lotid) = 7;

Remember to include the index clause in queries that need to use the index, even if it seems overkill:

PostgreSQL does not use a partial index

However, as you pointed out:

percentage for length (lotteries) = 7. [298350/411000 = 73% ]

This index will only help you if you can only get an index scan, because the condition is hardly selective. Because the table has very wide rows, index-only scans can be significantly faster.

Fake index scan

Also, they are rows=298350

dumped to rows=151

, so the lost index scan will pay off as I explained here:

Optimize GROUP BY query to get last record per user

Or on the Postgres Wiki - it is actually based on this post.

WITH RECURSIVE t AS (
   (SELECT handlerid FROM lotintro
    WHERE  length(lotid) = 7
    ORDER  BY 1 LIMIT 1)

   UNION ALL
   SELECT (SELECT handlerid FROM lotintro
           WHERE  length(lotid) = 7
           AND    handlerid > t.handlerid
           ORDER  BY 1 LIMIT 1)
   FROM  t
   WHERE t.handlerid IS NOT NULL
   )
SELECT handlerid FROM t
WHERE  handlerid IS NOT NULL;

It will be faster, but in combination with the partial index I suggested. Since the partial index is only half the size and is updated less frequently (depending on access patterns), it is cheaper overall.

Faster if you keep the table in a vacuum to only allow index scans. You can set more aggressive storage parameters for this table only if you have many records:

PostgreSQL database initial size

Finally, you can do it faster with a materialized view based on this query.

Improve query efficiency for duplicate queries

1)

2)

Fake index scan

More articles: