How does Spark SQL optimize joining? What are the optimization tricks?

Question

How does Spark SQL optimize joining? What are the optimization tricks?

I am trying to understand how spark 2.0 works for the DataFrame API Being a DataFrame, spark has knowledge of data structure.

When I join a large table to a small table, I realize that a smaller broadcast table is a good idea.
However, when connecting a large table to a large table, what kind of optimization tricks are there? Does sorting help? Or spark the internal sorting? When should I redo the data?

Any explanations would help

+3

apache-spark apache-spark-sql

user3325210 08 june 17 at 8:44

source to share

1 answer

Jacek Laskowski · Answer 1 · 2018-01-23T14:15:17+0000

DISCLAIMER . I'm still new to this area of join query optimization, so take it with a grain of salt.

Spark SQL comes with a JoinSelection execution scheduling strategy that translates a logical join to one of the supported physical join operators (for each physical request operator selection).

There are 6 different types of physical join operators:

BroadcastHashJoinExec

when the side of the left or right connection can be broadcast (i.e. less spark.sql.autoBroadcastJoinThreshold

, which is the default 10M

)
ShuffledHashJoinExec

when spark.sql.join.preferSortMergeJoin

disabled and it is possible to build hash maps for the left or right side of the join (among the requirements)
SortMergeJoinExec

when the left join keys are "ordered"
BroadcastNestedLoopJoinExec

when there are no connection keys, and the left or right side of the connection can be transferred
CartesianProductExec

when it is internally or transversely connected without a connection condition
BroadcastNestedLoopJoinExec

when no other matches

As you can see there is a lot of theory to digest "what kind of optimization tricks are there".

Does sorting help?

Yes. See Operator SortMergeJoinExec

.

Or could it cause an internal sort?

He will try, but people can (still?) Work miracles.

When should I redo the data?

Always if you can and you know cropping can help. This can reduce the number of lines to process and effectively resolve BroadcastHashJoinExec

over ShuffledHashJoinExec

or otherwise.

I also find that data rework can be of particular help in cost optimization, where pruning a table can reduce the number of columns and rows and in turn the size of the table and the cost of one join over others in general.

How does Spark SQL optimize joining? What are the optimization tricks?

More articles: