Sort within a section, but not sort cross sections using Spark RDD
This Hadoop MapReduce shuffles the default behavior for sorting the shuffle key in a section, but not for cross-sections (this is the full order that makes the sorting of keys cross-over parttions)
I would ask how to achieve the same using Spark RDD (sorting within a section, but not sorting cross sections)
- RDD method
sortByKey
does full order - The RDD
repartitionAndSortWithinPartitions
does sort within the section, but not cross, but unfortunately it adds an extra step to do the reallocation.
Is there a direct way to sort within a section, but not cross sections
+3
source to share
1 answer
You can use the method Dataset
and sortWithinPartitions
:
import spark.implicits._
sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
.toDF("text")
.sortWithinPartitions($"text")
.show
+----+
|text|
+----+
| d|
| e|
| f|
| a|
| b|
| c|
+----+
In general, shuffle is an important factor when sorting partitions because it reuses the shuffle structures to sort without loading all the data into memory at once.
+3
source to share