Sort within a section, but not sort cross sections using Spark RDD

This Hadoop MapReduce shuffles the default behavior for sorting the shuffle key in a section, but not for cross-sections (this is the full order that makes the sorting of keys cross-over parttions)

I would ask how to achieve the same using Spark RDD (sorting within a section, but not sorting cross sections)

  • RDD method sortByKey

    does full order
  • The RDD repartitionAndSortWithinPartitions

    does sort within the section, but not cross, but unfortunately it adds an extra step to do the reallocation.

Is there a direct way to sort within a section, but not cross sections

+3


source to share


1 answer


You can use the method Dataset

and sortWithinPartitions

:

import spark.implicits._

sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
  .toDF("text")
  .sortWithinPartitions($"text")
  .show

+----+
|text|
+----+
|   d|
|   e|
|   f|
|   a|
|   b|
|   c|
+----+

      



In general, shuffle is an important factor when sorting partitions because it reuses the shuffle structures to sort without loading all the data into memory at once.

+3


source







All Articles