SortByValue for RDD tuples
I was recently asked (in a class assignment) to find the first 10 occurring words inside an RDD. I presented my assignment with a working solution that looks like
wordsRdd
.map(x => (x, 1))
.reduceByKey(_ + _)
.map(case (x, y) => (y, x))
.sortByKey(false)
.map(case (x, y) => (y, x))
.take(10)
So basically, I change the binding, sort by key, and then change location again. Then finally grab 10. I don't find re-swapping very elegant.
So, I'm wondering if there is a more elegant way to do this.
I have searched and found some people using Scala implicits
to convert RDD to Scala sequence and then do sortByValue
, but I don't want to convert RDD to Scala Seq
because it will kill the distributed nature of RDD.
So is there a better way?
source to share
How about this:
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(Ordering.by(-1 * _._2))
or a little more verbose:
object WordCountPairsOrdering extends Ordering[(String, Int)] {
def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(WordCountPairsOrdering)
source to share