How does Spark SQL optimize joining? What are the optimization tricks?

I am trying to understand how spark 2.0 works for the DataFrame API Being a DataFrame, spark has knowledge of data structure.

  • When I join a large table to a small table, I realize that a smaller broadcast table is a good idea.

  • However, when connecting a large table to a large table, what kind of optimization tricks are there? Does sorting help? Or spark the internal sorting? When should I redo the data?

Any explanations would help

+3


source to share


1 answer


DISCLAIMER . I'm still new to this area of ​​join query optimization, so take it with a grain of salt.


Spark SQL comes with a JoinSelection execution scheduling strategy that translates a logical join to one of the supported physical join operators (for each physical request operator selection).

There are 6 different types of physical join operators:

  • BroadcastHashJoinExec

    when the side of the left or right connection can be broadcast (i.e. less spark.sql.autoBroadcastJoinThreshold

    , which is the default 10M

    )

  • ShuffledHashJoinExec

    when spark.sql.join.preferSortMergeJoin

    disabled and it is possible to build hash maps for the left or right side of the join (among the requirements)

  • SortMergeJoinExec

    when the left join keys are "ordered"

  • BroadcastNestedLoopJoinExec

    when there are no connection keys, and the left or right side of the connection can be transferred

  • CartesianProductExec

    when it is internally or transversely connected without a connection condition

  • BroadcastNestedLoopJoinExec

    when no other matches

As you can see there is a lot of theory to digest "what kind of optimization tricks are there".

Does sorting help?



Yes. See Operator SortMergeJoinExec

.

Or could it cause an internal sort?

He will try, but people can (still?) Work miracles.

When should I redo the data?

Always if you can and you know cropping can help. This can reduce the number of lines to process and effectively resolve BroadcastHashJoinExec

over ShuffledHashJoinExec

or otherwise.

I also find that data rework can be of particular help in cost optimization, where pruning a table can reduce the number of columns and rows and in turn the size of the table and the cost of one join over others in general.

0


source







All Articles