How do I get the last row from a DataFrame?

Question

How do I get the last row from a DataFrame?

I hava DataFrame, DataFrame hava two columns "value" and "timestamp", "timestmp" is ordered, I want to get the last row of DataFrame, what should I do?

this is my input:

+-----+---------+
|value|timestamp|
+-----+---------+
|    1|        1|
|    4|        2|
|    3|        3|
|    2|        4|
|    5|        5|
|    7|        6|
|    3|        7|
|    5|        8|
|    4|        9|
|   18|       10|
+-----+---------+

this is my code:

    val arr = Array((1,1),(4,2),(3,3),(2,4),(5,5),(7,6),(3,7),(5,8),(4,9),(18,10))
    var df=m_sparkCtx.parallelize(arr).toDF("value","timestamp")

this is my expected output:

+-----+---------+
|value|timestamp|
+-----+---------+
|   18|       10|
+-----+---------+

+3

scala apache-spark apache-spark-sql spark-dataframe

mentongwu 31 jul. 17 at 2:42

source to share

5 answers

If your timestamp column is unique and in ascending order, then there are following ways to get the last row

println(df.sort($"timestamp", $"timestamp".desc).first())

// Output [1,1]

df.sort($"timestamp", $"timestamp".desc).take(1).foreach(println)

// Output [1,1]

df.where($"timestamp" === df.count()).show

Output:

+-----+---------+
|value|timestamp|
+-----+---------+
|   18|       10|
+-----+---------+

If not create new column with index and select last index below

val df1 = spark.sqlContext.createDataFrame(
    df.rdd.zipWithIndex.map {
  case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
StructType(df.schema.fields :+ StructField("index", LongType, false)))

df1.where($"timestamp" === df.count()).drop("index").show

Output:

+-----+---------+
|value|timestamp|
+-----+---------+
|   18|       10|
+-----+---------+

+1

Shankar koirala 31 jul. 17 at 2:58

source to share

The most efficient way is reduce

your DataFrame. This gives you one row that you can convert back to a DataFrame, but since it only contains 1 record, it doesn't make much sense.

sparkContext.parallelize(
  Seq(
  df.reduce {
    (a, b) => if (a.getAs[Int]("timestamp") > b.getAs[Int]("timestamp")) a else b 
   } match {case Row(value:Int,timestamp:Int) => (value,timestamp)}
  )
)
.toDF("value","timestamp")
.show


+-----+---------+
|value|timestamp|
+-----+---------+
|   18|       10|
+-----+---------+

Less efficient (since it needs to be shuffled), although it will be less:

df
.where($"timestamp" === df.groupBy().agg(max($"timestamp")).map(_.getInt(0)).collect.head)

+1

Raphael Roth 31 jul. 17 at 5:49 am

source to share

I would use just a query that - orders its table in descending order - takes the 1st value from that order

df.createOrReplaceTempView("table_df")
query_latest_rec = """SELECT * FROM table_df ORDER BY value DESC limit 1"""
latest_rec = self.sqlContext.sql(query_latest_rec)
latest_rec.show()

0

Danylo Zherebetskyy 22 Feb 18 at 17:50

source to share

Try it, it works for me.

df.orderBy($"value".desc).show(1)

0

Mimii cheng 14 Mar '18 at 7:00

source to share

user8371915 · Accepted Answer · 2017-07-31T05:14:28+0000

I would just reduce

:

df.reduce { (x, y) => 
  if (x.getAs[Int]("timestamp") > y.getAs[Int]("timestamp")) x else y 
}

How do I get the last row from a DataFrame?

More articles: