Why isn't this query using the index?
I ran into strange behavior from the Postgres optimizer on the following query:
select count(product0_.id) as col_0_0_ from Product product0_
where product0_.active=true
and (product0_.aggregatorId is null
or product0_.aggregatorId in ($1 , $2 , $3))
Product
has about 54 columns, active
is a btree boolean, and aggregatorId
is "varchar (15)" and has a btree index.
In this question, above the index for 'aggregatorId' is not used:
Aggregate (cost=169995.75..169995.76 rows=1 width=32) (actual time=3904.726..3904.727 rows=1 loops=1)
-> Seq Scan on product product0_ (cost=0.00..165510.39 rows=1794146 width=32) (actual time=0.055..2407.195 rows=1851827 loops=1)
Filter: (active AND ((aggregatorid IS NULL) OR ((aggregatorid)::text = ANY ('{5109037,5001015,70601}'::text[]))))
Rows Removed by Filter: 542146
Total runtime: 3904.925 ms
But if we shrink the query, leaving a null check for that column, the index will be used:
Aggregate (cost=17600.93..17600.94 rows=1 width=32) (actual time=614.933..614.935 rows=1 loops=1)
-> Index Scan using idx_prod_aggr on product product0_ (cost=0.43..17487.56 rows=45347 width=32) (actual time=19.284..594.509 rows=12099 loops=1)
Index Cond: ((aggregatorid)::text = ANY ('{5109037,5001015,70601}'::text[]))
Filter: active
Rows Removed by Filter: 49130
Total runtime: 150.255 ms
As far as I know, the btree index can handle null checks, so I don't understand why the index is not being used for a full query. The product table contains about 2.3 million entries, so it's not very fast.
EDIT: The index is very standard:
CREATE INDEX idx_prod_aggr
ON product
USING btree
(aggregatorid COLLATE pg_catalog."default");
source to share
Since there are many of the same values โโfor the column you are using in the where clause (78% of all table rows according to your numbers), the database will conclude that it is cheaper to use a full table scan than to discard the extra time to read the index.
The rule of thumb for most database vendors is that an index will probably not be used unless it can narrow the search down to about 5% of all records in the table.
source to share
Your problem looked interesting, so I reproduced your scenario - postgres 9.1, table with 1M rows, one boolean column, one varchar column, indexed, half of the table has NULL names.
I had the same analysis parsing the output when the varchar column was not indexed. However, with postgres, index uses a NULL state raster scan and an IN clause and then combines them with an OR clause.
Then it uses seq check in boolean state (since the indices are split)
explain analyze
select * from A where active is true and ((name is null) OR (name in ('1','2','3') ));
See the output:
"Bitmap Heap Scan on a (cost=17.34..21.35 rows=1 width=18) (actual time=0.048..0.048 rows=0 loops=1)"
" Recheck Cond: ((name IS NULL) OR ((name)::text = ANY ('{1,2,3}'::text[])))"
" Filter: (active IS TRUE)"
" -> BitmapOr (cost=17.34..17.34 rows=1 width=0) (actual time=0.047..0.047 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_prod_aggr (cost=0.00..4.41 rows=1 width=0) (actual time=0.010..0.010 rows=0 loops=1)"
" Index Cond: (name IS NULL)"
" -> Bitmap Index Scan on idx_prod_aggr (cost=0.00..12.93 rows=1 width=0) (actual time=0.036..0.036 rows=0 loops=1)"
" Index Cond: ((name)::text = ANY ('{1,2,3}'::text[]))"
"Total runtime: 0.077 ms"
This makes me think that you missed some details, if so please add them to your question.
source to share