Sort within a section, but not sort cross sections using Spark RDD

Question

Sort within a section, but not sort cross sections using Spark RDD

This Hadoop MapReduce shuffles the default behavior for sorting the shuffle key in a section, but not for cross-sections (this is the full order that makes the sorting of keys cross-over parttions)

I would ask how to achieve the same using Spark RDD (sorting within a section, but not sorting cross sections)

RDD method sortByKey

does full order
The RDD repartitionAndSortWithinPartitions

does sort within the section, but not cross, but unfortunately it adds an extra step to do the reallocation.

Is there a direct way to sort within a section, but not cross sections

+3

apache-spark

Tom 11 Apr 17 at 7:08

source to share

1 answer

user7849215 · Accepted Answer · 2017-04-11T07:52:53+0000

You can use the method Dataset

and sortWithinPartitions

:

import spark.implicits._

sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
  .toDF("text")
  .sortWithinPartitions($"text")
  .show

+----+
|text|
+----+
|   d|
|   e|
|   f|
|   a|
|   b|
|   c|
+----+

In general, shuffle is an important factor when sorting partitions because it reuses the shuffle structures to sort without loading all the data into memory at once.

Sort within a section, but not sort cross sections using Spark RDD

More articles: