How does Spark SQL optimize joining? What are the optimization tricks?
I am trying to understand how spark 2.0 works for the DataFrame API Being a DataFrame, spark has knowledge of data structure.
-
When I join a large table to a small table, I realize that a smaller broadcast table is a good idea.
-
However, when connecting a large table to a large table, what kind of optimization tricks are there? Does sorting help? Or spark the internal sorting? When should I redo the data?
Any explanations would help
source to share
DISCLAIMER . I'm still new to this area of ββjoin query optimization, so take it with a grain of salt.
Spark SQL comes with a JoinSelection execution scheduling strategy that translates a logical join to one of the supported physical join operators (for each physical request operator selection).
There are 6 different types of physical join operators:
-
BroadcastHashJoinExec
when the side of the left or right connection can be broadcast (i.e. lessspark.sql.autoBroadcastJoinThreshold
, which is the default10M
) -
ShuffledHashJoinExec
whenspark.sql.join.preferSortMergeJoin
disabled and it is possible to build hash maps for the left or right side of the join (among the requirements) -
SortMergeJoinExec
when the left join keys are "ordered" -
BroadcastNestedLoopJoinExec
when there are no connection keys, and the left or right side of the connection can be transferred -
CartesianProductExec
when it is internally or transversely connected without a connection condition -
BroadcastNestedLoopJoinExec
when no other matches
As you can see there is a lot of theory to digest "what kind of optimization tricks are there".
Does sorting help?
Yes. See Operator SortMergeJoinExec
.
Or could it cause an internal sort?
He will try, but people can (still?) Work miracles.
When should I redo the data?
Always if you can and you know cropping can help. This can reduce the number of lines to process and effectively resolve BroadcastHashJoinExec
over ShuffledHashJoinExec
or otherwise.
I also find that data rework can be of particular help in cost optimization, where pruning a table can reduce the number of columns and rows and in turn the size of the table and the cost of one join over others in general.
source to share