Why am I getting "Hash Join" and FTS in this PostgreSQL query?

I am trying to optimize the following scenario:

In text format: I have 2 tables, alerts

and user_devices

; c user_devices

we keep track of whether the device associated with user_id

receiving notifications or not, and in the table alerts

we track the user's attitude to the notification. Basically, the challenge is to select each user_id

one that has any alerts and allows you to notify about any of the devices registered on it.

"Alerts" table, about 900 thousand records:

               Table "public.alerts"
   Column    |           Type           | Modifiers 
-------------+--------------------------+-----------
 id          | uuid                     | not null
 user_id     | uuid                     | 
 target_id   | uuid                     | 
 target_type | text                     | 
 added_on    | timestamp with time zone | 
 old_id      | text                     | 
Indexes:
    "alerts_pkey" PRIMARY KEY, btree (id)
    "one_alert_per_business_per_user" UNIQUE CONSTRAINT, btree (user_id, target_id)
    "addedon" btree (added_on)
    "targetid" btree (target_id)
    "userid" btree (user_id)
    "userid_targetid" btree (user_id, target_id)
Foreign-key constraints:
    "alerts_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id)

      

Table 'user_devices', about 12 thousand records:

                Table "public.user_devices"
       Column        |           Type           | Modifiers 
---------------------+--------------------------+-----------
 id                  | uuid                     | not null
 user_id             | uuid                     | 
 device_id           | text                     | 
 device_token        | text                     | 
 push_notify_enabled | boolean                  | 
 device_type         | integer                  | 
 device_name         | text                     | 
 badge_count         | integer                  | 
 added_on            | timestamp with time zone | 
Indexes:
    "user_devices_pkey" PRIMARY KEY, btree (id)
    "push_notification" btree (push_notify_enabled)
    "user_id" btree (user_id)
    "user_id_push_notification" btree (user_id, push_notify_enabled)
Foreign-key constraints:
    "user_devices_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id)

      

Next query:

select COUNT(DISTINCT a.user_id) 
from alerts a 
  inner join user_devices ud on a.user_id = ud.user_id 
WHERE ud.push_notify_enabled = true;

      

It takes about 3 seconds and gives the following plan:

explain select COUNT(DISTINCT a.user_id) from alerts a inner join user_devices ud on a.user_id = ud.user_id WHERE ud.push_notify_enabled = true;
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Aggregate  (cost=49777.32..49777.33 rows=1 width=16)
   ->  Hash Join  (cost=34508.97..48239.63 rows=615074 width=16)
         Hash Cond: (ud.user_id = a.user_id)
         ->  Seq Scan on user_devices ud  (cost=0.00..480.75 rows=9202 width=16)
               Filter: push_notify_enabled
         ->  Hash  (cost=20572.32..20572.32 rows=801732 width=16)
               ->  Seq Scan on alerts a  (cost=0.00..20572.32 rows=801732 width=16)

      

What I am missing, is there a way to speed this up?

Thank.

== edit ==

As suggested, tried to move the condition inside the join, no difference:

=> explain select COUNT(DISTINCT a.user_id) from alerts a inner join user_devices ud on a.user_id = ud.user_id and ud.push_notify_enabled;
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Aggregate  (cost=49777.32..49777.33 rows=1 width=16)
   ->  Hash Join  (cost=34508.97..48239.63 rows=615074 width=16)
         Hash Cond: (ud.user_id = a.user_id)
         ->  Seq Scan on user_devices ud  (cost=0.00..480.75 rows=9202 width=16)
               Filter: push_notify_enabled
         ->  Hash  (cost=20572.32..20572.32 rows=801732 width=16)
               ->  Seq Scan on alerts a  (cost=0.00..20572.32 rows=801732 width=16)

      

So there is no way to get away from 2 FTS? If I could even get him to somehow use the index on the alert table, that would be great ..

== edit ==

Adding `EXPLAIN ANALYZE.

=> explain ANALYZE select COUNT(DISTINCT a.user_id) from alerts a inner join user_devices ud on a.user_id = ud.user_id and ud.push_notify_enabled;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=49777.32..49777.33 rows=1 width=16) (actual time=5254.355..5254.356 rows=1 loops=1)
   ->  Hash Join  (cost=34508.97..48239.63 rows=615074 width=16) (actual time=1824.607..2863.635 rows=614768 loops=1)
         Hash Cond: (ud.user_id = a.user_id)
         ->  Seq Scan on user_devices ud  (cost=0.00..480.75 rows=9202 width=16) (actual time=0.048..16.784 rows=9186 loops=1)
               Filter: push_notify_enabled
         ->  Hash  (cost=20572.32..20572.32 rows=801732 width=16) (actual time=1824.229..1824.229 rows=801765 loops=1)
               Buckets: 4096  Batches: 32  Memory Usage: 990kB
               ->  Seq Scan on alerts a  (cost=0.00..20572.32 rows=801732 width=16) (actual time=0.047..878.429 rows=801765 loops=1)
 Total runtime: 5255.427 ms
(9 rows)

      

=== Edit ===

Adding the requested configuration. Most of them are Ubuntu PG9.1 by default:

/etc/postgresql/9.1/main# cat postgresql.conf | grep -e "work_mem" -e "effective_cache" -e "shared_buff" -e "random_page_c"
shared_buffers = 24MB           # min 128kB
#work_mem = 1MB             # min 64kB
#maintenance_work_mem = 16MB        # min 1MB
#wal_buffers = -1           # min 32kB, -1 sets based on shared_buffers
#random_page_cost = 4.0         # same scale as above
#effective_cache_size = 128MB

      

+3


source to share


2 answers


Replacing an index with a partial index:

DROP INDEX    user_id_push_notification ;
CREATE INDEX    user_id_push_notification ON user_devices (user_id)
 WHERE push_notify_enabled =True
 ;

      

and set the random_page_cost to a lower value:

SET random_page_cost = 1.1;



Called a Index Scan using push_notification on user_devices ud

for me (<300ms). YMMV.

Seqscan on alert seems more or less inevitable as you would expect 800K / 900K: = 88% lines). An index scan would only be effective if the rowizes were very large IMHO.

UPDATE: Adding the users table to the query seems to result in a triple index scan. (but around the same time)

explain  ANALYZE
select COUNT(DISTINCT a.user_id)
from alerts a
join user_devices ud on a.user_id = ud.user_id
join users us on a.user_id = us.id
WHERE ud.push_notify_enabled = true;

      

+1


source


As said in the comments, the real pig is a full table scan alerts

. Logically, for a given user ID, any and all records in alerts

can match this user ID.

Do you have another condition that can restrict the scan: push_notify_enabled

; you don't need the lines where it is false

. But you are missing an index on that column, so a full scan on alerts

is still the fastest way to join two tables.

Try using the bitmap index on push_notify_enabled

if your Postgres version supports it. (Obviously, a btree index on a 2-valued column is not good.)

To speed up your query, you need to limit the number of rows that need to be scanned into alerts

, that is, add a condition to some indexed column alerts

. It is then possible to scan the index instead of a full scan if the index is selective enough.



For example, you can filter by target id or date-related column if that makes sense.

If you have 900k alerts that are active and can be arbitrarily assigned to users, you have little choice; perhaps adding RAM to keep the table alerts

always cached might help. (Adding hardware is often the simplest and most economical solution.)

AFAICT you are only interested in push notification related alerts. If users with push notifications never exchange notifications with users without push notifications, you can effectively split alerts

on this condition.

If you have bitmap indexes, you can move the column push_notify_enabled

to alerts

. Alternatively, you can try to physically partition it into this column using partitioning . If the number of alerts with push notifications is significantly lower than the general alerts, a much smaller fraction will be scanned for its connection alerts

.

+1


source







All Articles