How do I map a structure in a DataFrame to a case class?

At some point in my application, I have a DataFrame with a Struct field generated from a case class. Now I want to display / map it back to the case class:

import spark.implicits._
case class Location(lat: Double, lon: Double)

scala> Seq((10, Location(35, 25)), (20, Location(45, 35))).toDF
res25: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<lat: double, lon: double>]

scala> res25.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- lat: double (nullable = false)
 |    |-- lon: double (nullable = false)

      

And basic:

res25.map(r => {
   Location(r.getStruct(1).getDouble(0), r.getStruct(1).getDouble(1))
}).show(1)

      

It looks really messy. Is there an easier way?

+3


source to share


3 answers


In Spark 1.6+, if you want to store stored type information, use Dataset (DS), not DataFrame (DF).

import spark.implicits._
case class Location(lat: Double, lon: Double)

scala> Seq((10, Location(35, 25)), (20, Location(45, 35))).toDS
res25: org.apache.spark.sql.Dataset[(Int, Location)] = [_1: int, _2: struct<lat: double, lon: double>]

scala> res25.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- lat: double (nullable = false)
 |    |-- lon: double (nullable = false)

      

He will give you Dataset[(Int, Location)]

. Now, if you want to go back to this origin of the case class again, just follow these steps:



scala> res25.map(r => r._2).show(1)
+----+----+
| lat| lon|
+----+----+
|35.0|25.0|
+----+----+

      

But if you want to stick with the DataFrame API because of its dynamic type, then you need to code it like this:

scala> res25.select("_2.*").map(r => Location(r.getDouble(0), r.getDouble(1))).show(1)
+----+----+
| lat| lon|
+----+----+
|35.0|25.0|
+----+----+

      

+2


source


You can also use the extractor pattern in Row

, which will give you similar results using more idiomatic scala:



scala> res25.map { row =>
  (row: @unchecked) match {
    case Row(a: Int, Row(b: Double, c: Double)) => (a, Location(b, c))
  }
}
res26: org.apache.spark.sql.Dataset[(Int, Location)] = [_1: int, _2: struct<lat: double, lon: double>]
scala> res26.collect()
res27: Array[(Int, Location)] = Array((10,Location(35.0,25.0)), (20,Location(45.0,35.0)))

      

0


source


I think the other answers have nailed it, but they might need a different wording.

In short, it is not possible to use case classes in DataFrames, as they are not case classes and are used RowEncoder

to map internal SQL types to Row

.

As other answers said, you have to convert Row

based on DataFrame

to Dataset

using operator as

.

val df = Seq((10, Location(35, 25)), (20, Location(45, 35))).toDF
scala> val ds = df.as[(Int, Location)]
ds: org.apache.spark.sql.Dataset[(Int, Location)] = [_1: int, _2: struct<lat: double, lon: double>]

scala> ds.show
+---+-----------+
| _1|         _2|
+---+-----------+
| 10|[35.0,25.0]|
| 20|[45.0,35.0]|
+---+-----------+

scala> ds.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- lat: double (nullable = false)
 |    |-- lon: double (nullable = false)

scala> ds.map[TAB pressed twice]

def map[U](func: org.apache.spark.api.java.function.MapFunction[(Int, Location),U],encoder: org.apache.spark.sql.Encoder[U]): org.apache.spark.sql.Dataset[U]
def map[U](func: ((Int, Location)) => U)(implicit evidence$6: org.apache.spark.sql.Encoder[U]): org.apache.spark.sql.Dataset[U]

      

0


source







All Articles