Scala / Spark: how we evaluate a dataframe - with SqlContext and Version 1.5
I have sample data as shown below:
I / P
accountNumber assetValue
A100 1000
A100 500
B100 600
B100 200
o / r
AccountNumber assetValue Rank
A100 1000 1
A100 500 2
B100 600 1
B100 200 2
Now my question is how to add this ranking column to a dataframe that is sorted by account number. I don't expect the sheer volume of strings to be so open to an idea if I need to do this outside of the data frame.
I spark version 1.5 and sql context hence cannot use windows function
+3
source to share
2 answers
Raw SQL:
val df = sc.parallelize(Seq(
("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)
)).toDF("accountNumber", "assetValue")
df.registerTempTable("df")
sqlContext.sql("SELECT accountNumber,assetValue, RANK() OVER (partition by accountNumber ORDER BY assetValue desc) AS rank FROM df").show
+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
| A100| 1000| 1|
| A100| 500| 2|
| B100| 600| 1|
| B100| 200| 2|
+-------------+----------+----+
+4
source to share
You can use a function row_number
and an expression Window
with which you can specify columns partition
and order
:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val df = Seq(("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)).toDF("accountNumber", "assetValue")
df.withColumn("rank", row_number().over(Window.partitionBy($"accountNumber").orderBy($"assetValue".desc))).show
+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
| A100| 1000| 1|
| A100| 500| 2|
| B100| 600| 1|
| B100| 200| 2|
+-------------+----------+----+
+2
source to share