Is it impossible for Oracle to output these queries the same - a completely different execution plan worked
I have the following two very similar subquery statements. I have highlighted the difference with **.
1
SELECT DISTINCT name
FROM person, nameindex n
WHERE person.id1 ='0812'
AND person.id2 =n.id2
AND person.id1 =n.id1
AND n.phonetic IN
(SELECT n2.phonetic
FROM nameindex n2
WHERE n2.id1=person.id1 **
GROUP BY n2.phonetic
HAVING COUNT(*) BETWEEN 4 AND 500)
2:
SELECT DISTINCT name
FROM person, nameindex n
WHERE person.id1 ='0812'
AND person.id2 =n.id2
AND person.id1 =n.id1
AND n.phonetic IN
(SELECT n2.phonetic
FROM nameindex n2
WHERE n2.id1='0812' **
GROUP BY n2.phonetic
HAVING COUNT(*) BETWEEN 4 AND 500)
I think the oracle could infer that it person.id1
should be constant 0812
in the subquery. However, both queries give extremely different execution plans and costs (1: cost 4404211855, and 2: cost: 36237). Why is this?
This is an analytical query rather than OLTP, so there are no indexes for this particular query.
(Query background: Get the names of persons within id1 = '0812' that have a phonetic entry in the index name table for which there are 4 to 500 entries.)
source to share
I followed a similar query with the following setup:
CREATE TABLE person (id1, id2, NAME) AS
SELECT to_char(mod(ROWNUM, 1000), 'fm0000'), ROWNUM,
dbms_random.string('A',10)
FROM dual
CONNECT BY LEVEL <= 1e6;
CREATE TABLE nameindex (id1, id2, phonetic) AS
SELECT id1, id2, to_char(dbms_random.value(1, 200), 'fm000')
FROM person;
I found that your first request creates the following plan:
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 291343677
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2040 | 331K (2)| 01:06:2
| 1 | HASH UNIQUE | | 1 | 2040 | 331K (2)| 01:06:2
|* 2 | FILTER | | | | |
|* 3 | HASH JOIN | | 891 | 1775K| 1750 (2)| 00:00:2
|* 4 | TABLE ACCESS FULL | NAMEINDEX | 892 | 18732 | 739 (2)| 00:00:0
|* 5 | TABLE ACCESS FULL | PERSON | 1395 | 2750K| 1010 (2)| 00:00:1
|* 6 | FILTER | | | | |
| 7 | HASH GROUP BY | | 9550 | 76400 | 740 (2)| 00:00:0
|* 8 | TABLE ACCESS FULL| NAMEINDEX | 9550 | 76400 | 739 (2)| 00:00:0
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter( EXISTS (SELECT 0 FROM "NAMEINDEX" "N2" WHERE "N2"."ID1"=:B1
GROUP BY "N2"."PHONETIC" HAVING "N2"."PHONETIC"=:B2 AND COUNT(*)>=
COUNT(*)<=500))
3 - access("PERSON"."ID2"="N"."ID2" AND "PERSON"."ID1"="N"."ID1")
4 - filter("N"."ID1"='0812')
5 - filter("PERSON"."ID1"='0812')
6 - filter("N2"."PHONETIC"=:B1 AND COUNT(*)>=4 AND COUNT(*)<=500)
8 - filter("N2"."ID1"=:B1)
As you can see that the IN semi-join is rewritten as EXISTS, which produces the same plan as this query:
SELECT DISTINCT NAME
FROM person, nameindex n
WHERE person.id1 = '0812'
AND person.id2 = n.id2
AND person.id1 = n.id1
AND EXISTS (SELECT NULL
FROM nameindex n2
WHERE n2.id1 = person.id1
AND n2.phonetic = n.phonetic
GROUP BY n2.phonetic
HAVING COUNT(*) BETWEEN 4 AND 500);
Here you can see that the subquery is NOT constant and therefore is calculated for each row of the main query, resulting in a less optimal execution plan.
I suggest you use all join columns in the GROUP BY when you are using an aggregate semi-join. The following query creates the optimal plan:
SELECT DISTINCT NAME
FROM person, nameindex n
WHERE person.id1 = '0812'
AND person.id2 = n.id2
AND person.id1 = n.id1
AND (n.id1, n.phonetic) IN (SELECT n2.id1, n2.phonetic
FROM nameindex n2
GROUP BY n2.id1, n2.phonetic
HAVING COUNT(*) BETWEEN 4 AND 500);
source to share