How do I map a structure in a DataFrame to a case class?

Question

How do I map a structure in a DataFrame to a case class?

At some point in my application, I have a DataFrame with a Struct field generated from a case class. Now I want to display / map it back to the case class:

import spark.implicits._
case class Location(lat: Double, lon: Double)

scala> Seq((10, Location(35, 25)), (20, Location(45, 35))).toDF
res25: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<lat: double, lon: double>]

scala> res25.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- lat: double (nullable = false)
 |    |-- lon: double (nullable = false)

And basic:

res25.map(r => {
   Location(r.getStruct(1).getDouble(0), r.getStruct(1).getDouble(1))
}).show(1)

It looks really messy. Is there an easier way?

+3

scala apache-spark spark-dataframe apache-spark-2.0

Atais 08 Apr 17 at 14:38

source to share

3 answers

You can also use the extractor pattern in Row

, which will give you similar results using more idiomatic scala:

scala> res25.map { row =>
  (row: @unchecked) match {
    case Row(a: Int, Row(b: Double, c: Double)) => (a, Location(b, c))
  }
}
res26: org.apache.spark.sql.Dataset[(Int, Location)] = [_1: int, _2: struct<lat: double, lon: double>]
scala> res26.collect()
res27: Array[(Int, Location)] = Array((10,Location(35.0,25.0)), (20,Location(45.0,35.0)))

0

jamborta 08 Apr 17 at 15:40

source to share

I think the other answers have nailed it, but they might need a different wording.

In short, it is not possible to use case classes in DataFrames, as they are not case classes and are used RowEncoder

to map internal SQL types to Row

.

As other answers said, you have to convert Row

based on DataFrame

to Dataset

using operator as

.

val df = Seq((10, Location(35, 25)), (20, Location(45, 35))).toDF
scala> val ds = df.as[(Int, Location)]
ds: org.apache.spark.sql.Dataset[(Int, Location)] = [_1: int, _2: struct<lat: double, lon: double>]

scala> ds.show
+---+-----------+
| _1|         _2|
+---+-----------+
| 10|[35.0,25.0]|
| 20|[45.0,35.0]|
+---+-----------+

scala> ds.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- lat: double (nullable = false)
 |    |-- lon: double (nullable = false)

scala> ds.map[TAB pressed twice]

def map[U](func: org.apache.spark.api.java.function.MapFunction[(Int, Location),U],encoder: org.apache.spark.sql.Encoder[U]): org.apache.spark.sql.Dataset[U]
def map[U](func: ((Int, Location)) => U)(implicit evidence$6: org.apache.spark.sql.Encoder[U]): org.apache.spark.sql.Dataset[U]

0

Jacek Laskowski 09 Apr 17 at 19:08

source to share

himanshuIIITian · Accepted Answer · 2017-04-08T15:04:25+0000

In Spark 1.6+, if you want to store stored type information, use Dataset (DS), not DataFrame (DF).

import spark.implicits._
case class Location(lat: Double, lon: Double)

scala> Seq((10, Location(35, 25)), (20, Location(45, 35))).toDS
res25: org.apache.spark.sql.Dataset[(Int, Location)] = [_1: int, _2: struct<lat: double, lon: double>]

scala> res25.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- lat: double (nullable = false)
 |    |-- lon: double (nullable = false)

He will give you Dataset[(Int, Location)]

. Now, if you want to go back to this origin of the case class again, just follow these steps:

scala> res25.map(r => r._2).show(1)
+----+----+
| lat| lon|
+----+----+
|35.0|25.0|
+----+----+

But if you want to stick with the DataFrame API because of its dynamic type, then you need to code it like this:

scala> res25.select("_2.*").map(r => Location(r.getDouble(0), r.getDouble(1))).show(1)
+----+----+
| lat| lon|
+----+----+
|35.0|25.0|
+----+----+

How do I map a structure in a DataFrame to a case class?

More articles: