How to sort on multiple columns using takeOrdered?

Question

How to sort on multiple columns using takeOrdered?

How to sort by 2 or more columns using takeOrdered(4)(Ordering[Int])

a Spark-Scala approach .

I can achieve this using sortBy like this:

lines.sortBy(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).map(p => println(p)).take(50)

But when I try to sort using the takeOrdered method its failure is

+3

scala apache-spark

AJm Apr 22. 17 at 8:53

source to share

2 answers

Vidya · Answer 1 · 2017-04-22T17:09:57+0000

tl; dr Do something like this (but consider rewriting the code to split

only be called once):

lines.map(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).takeOrdered(50)

Here's an explanation.

When you call takeOrdered

directly on lines

, the effect implicit Ordering

that takes effect Ordering[String]

is because it lines

is RDD[String]

. You need to convert lines

to a new one RDD[(Int, Int)]

. Since it exists implicit Ordering[(Int, Int)]

, it acts on the transformed one RDD

.

Meanwhile, it sortBy

works a little differently. Here is the signature:

sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

I know this is a daunting signature, but if you get over the noise, you will see that it sortBy

uses a function that maps your original type to a new type for sorting purposes only, and applies Ordering

the return type for that if it is in scope implicit

.

In your case, you apply a function to String

in yours RDD

to convert them to a "representation" of how Spark should handle them for sorting purposes only, that is, how (Int, Int)

, and then relying on what's implicit Ordering[(Int, Int)]

available as specified.

The approach sortBy

allows you to keep the lines

same as RDD[String]

and use the collation only for sorting, while the approach takeOrdered

works with a new RDD

containing (Int, Int)

one obtained from the original lines

. Whichever approach is more suitable for your needs depends on what you want to achieve.

On another note, you probably want to rewrite your code to split

your text only once.

Harald · Answer 2 · 2017-04-22T10:06:55+0000

You can implement your order:

lines.takeOrdered(4)(new Ordering[String] {
  override def compare(x: String, y: String): Int = {
    val xs=x.split(",")
    val ys=y.split(",")
    val d1 = xs(1).toInt - ys(1).toInt
    if (d1 != 0) d1 else ys(4).toInt - xs(4).toInt
  }
})

How to sort on multiple columns using takeOrdered?

More articles: