How to sort on multiple columns using takeOrdered?
How to sort by 2 or more columns using takeOrdered(4)(Ordering[Int])
a Spark-Scala approach .
I can achieve this using sortBy like this:
lines.sortBy(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).map(p => println(p)).take(50)
But when I try to sort using the takeOrdered method its failure is
source to share
tl; dr Do something like this (but consider rewriting the code to split
only be called once):
lines.map(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).takeOrdered(50)
Here's an explanation.
When you call takeOrdered
directly on lines
, the effect implicit Ordering
that takes effect Ordering[String]
is because it lines
is RDD[String]
. You need to convert lines
to a new one RDD[(Int, Int)]
. Since it exists implicit Ordering[(Int, Int)]
, it acts on the transformed one RDD
.
Meanwhile, it sortBy
works a little differently. Here is the signature:
sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
I know this is a daunting signature, but if you get over the noise, you will see that it sortBy
uses a function that maps your original type to a new type for sorting purposes only, and applies Ordering
the return type for that if it is in scope implicit
.
In your case, you apply a function to String
in yours RDD
to convert them to a "representation" of how Spark should handle them for sorting purposes only, that is, how (Int, Int)
, and then relying on what's implicit Ordering[(Int, Int)]
available as specified.
The approach sortBy
allows you to keep the lines
same as RDD[String]
and use the collation only for sorting, while the approach takeOrdered
works with a new RDD
containing (Int, Int)
one obtained from the original lines
. Whichever approach is more suitable for your needs depends on what you want to achieve.
On another note, you probably want to rewrite your code to split
your text only once.
source to share