SortByValue for RDD tuples

I was recently asked (in a class assignment) to find the first 10 occurring words inside an RDD. I presented my assignment with a working solution that looks like

wordsRdd
  .map(x => (x, 1))
  .reduceByKey(_ + _)
  .map(case (x, y) => (y, x))
  .sortByKey(false)
  .map(case (x, y) => (y, x))
  .take(10)

      

So basically, I change the binding, sort by key, and then change location again. Then finally grab 10. I don't find re-swapping very elegant.

So, I'm wondering if there is a more elegant way to do this.

I have searched and found some people using Scala implicits

to convert RDD to Scala sequence and then do sortByValue

, but I don't want to convert RDD to Scala Seq

because it will kill the distributed nature of RDD.

So is there a better way?

+3


source to share


1 answer


How about this:

wordsRdd.
    map(x => (x, 1)).
    reduceByKey(_ + _).
    takeOrdered(10)(Ordering.by(-1 * _._2))

      



or a little more verbose:

object WordCountPairsOrdering extends Ordering[(String, Int)] {
    def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}

wordsRdd.
    map(x => (x, 1)).
    reduceByKey(_ + _).
    takeOrdered(10)(WordCountPairsOrdering)

      

+3


source







All Articles