Why does Spark explain these two queries differently?
So, I have these two queries to achieve the same goal. Using Spark-SQL.
Request A:
SELECT * FROM inspex.defect_parquet a
INNER JOIN inspex.layer_parquet b
ON a.id = b.id
AND b.name = 'Example1';
Request B:
SELECT * FROM inspex.defect_parquet
WHERE inspex.layer_scan_index
IN (SELECT layer_scan_index
FROM inspex.layer_parquet
WHERE name = 'Example1');
defect_parquet
is a rather large table, but layer_parquet
is a small table with several hundred kilobytes.
Request B is 80% faster than request A. And when I see an explanation of how Spark runs this. Here for Query A: Here for Query B:
It looks like Spark handles them differently. Can anyone explain this to me? And why is the request faster?
+3
source to share
1 answer
I think the statistics tell everything:
- both versions use a broadcast connection
- however, in the second query you are creating the project in a subquery, so the output table is much smaller - this results in a significantly smaller broadcast size and reduced time
- the first query tries to prefilter a large dataset, however without many changes - the dataset is still large, so this optimization only slows down your number one query.
+1
source to share