Why does Spark explain these two queries differently?

So, I have these two queries to achieve the same goal. Using Spark-SQL.

Request A:

SELECT * FROM inspex.defect_parquet a
INNER JOIN inspex.layer_parquet b
ON a.id = b.id 
AND b.name = 'Example1';

      

Request B:

SELECT * FROM inspex.defect_parquet 
WHERE inspex.layer_scan_index  
IN    (SELECT layer_scan_index
      FROM inspex.layer_parquet
      WHERE name = 'Example1');

      

defect_parquet

is a rather large table, but layer_parquet

is a small table with several hundred kilobytes.

Request B is 80% faster than request A. And when I see an explanation of how Spark runs this. Here for Query A: enter image description here Here for Query B:

It looks like Spark handles them differently. Can anyone explain this to me? And why is the request faster?

+3


source to share


1 answer


I think the statistics tell everything:



  • both versions use a broadcast connection
  • however, in the second query you are creating the project in a subquery, so the output table is much smaller - this results in a significantly smaller broadcast size and reduced time
  • the first query tries to prefilter a large dataset, however without many changes - the dataset is still large, so this optimization only slows down your number one query.
+1


source







All Articles