Why does Spark explain these two queries differently?

Question

Why does Spark explain these two queries differently?

So, I have these two queries to achieve the same goal. Using Spark-SQL.

Request A:

SELECT * FROM inspex.defect_parquet a
INNER JOIN inspex.layer_parquet b
ON a.id = b.id 
AND b.name = 'Example1';

Request B:

SELECT * FROM inspex.defect_parquet 
WHERE inspex.layer_scan_index  
IN    (SELECT layer_scan_index
      FROM inspex.layer_parquet
      WHERE name = 'Example1');

defect_parquet

is a rather large table, but layer_parquet

is a small table with several hundred kilobytes.

Request B is 80% faster than request A. And when I see an explanation of how Spark runs this. Here for Query A: enter image description here Here for Query B:

It looks like Spark handles them differently. Can anyone explain this to me? And why is the request faster?

+3

hadoop apache-spark apache-spark-sql

Jesse 08 june 17 at 15:13

source to share

1 answer

T. Gawęda · Accepted Answer · 2017-06-08T18:32:45+0000

I think the statistics tell everything:

both versions use a broadcast connection
however, in the second query you are creating the project in a subquery, so the output table is much smaller - this results in a significantly smaller broadcast size and reduced time
the first query tries to prefilter a large dataset, however without many changes - the dataset is still large, so this optimization only slows down your number one query.

Why does Spark explain these two queries differently?

More articles: