SortByValue for RDD tuples

Question

SortByValue for RDD tuples

I was recently asked (in a class assignment) to find the first 10 occurring words inside an RDD. I presented my assignment with a working solution that looks like

wordsRdd
  .map(x => (x, 1))
  .reduceByKey(_ + _)
  .map(case (x, y) => (y, x))
  .sortByKey(false)
  .map(case (x, y) => (y, x))
  .take(10)

So basically, I change the binding, sort by key, and then change location again. Then finally grab 10. I don't find re-swapping very elegant.

So, I'm wondering if there is a more elegant way to do this.

I have searched and found some people using Scala implicits

to convert RDD to Scala sequence and then do sortByValue

, but I don't want to convert RDD to Scala Seq

because it will kill the distributed nature of RDD.

So is there a better way?

+3

scala apache-spark rdd

Knows not much 23 june 15 at 18:59

source to share

1 answer

zero323 · Accepted Answer · 2015-06-23T19:18:24+0000

How about this:

wordsRdd.
    map(x => (x, 1)).
    reduceByKey(_ + _).
    takeOrdered(10)(Ordering.by(-1 * _._2))

or a little more verbose:

object WordCountPairsOrdering extends Ordering[(String, Int)] {
    def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}

wordsRdd.
    map(x => (x, 1)).
    reduceByKey(_ + _).
    takeOrdered(10)(WordCountPairsOrdering)

SortByValue for RDD tuples

More articles: