Why does Spark SQL translate String "null" to Object null for Float / Double types?
I have a dataframe containing values float
and double
.
scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]
scala> df.show
+---+---+
| x| y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+
scala> df.printSchema
root
|-- x: float (nullable = false)
|-- y: double (nullable = false)
When I replace values โโwith NaN
value null
, I gave null
as String for Map operation in fill
.
scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)
scala> df.na.fill(map).printSchema
root
|-- x: float (nullable = true)
|-- y: double (nullable = true)
scala> df.na.fill(map).show
+----+----+
| x| y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+
And I got the correct value. But I couldn't figure out how and why Spark SQL translates null
as String object to object null
?
source to share
If you reviewed the function fill
in Dataset
, it checks the data type and tries to convert to the data type of its column schema. If it can be converted, it is converted, otherwise it returns null.
It does not convert to " null
" an object null
, but returns null if an exception occurs during conversion.
val map = df.columns.map((_, "WHATEVER")).toMap
gives null
and val map = df.columns.map((_, "9999.99")).toMap
gives 9999.99
If you want to update NAN
with the same data type, you can get the result as expected.
Hope this helps you understand!
source to share
I looked at the source code, fill
your string was added to double / float:
private def fillCol[T](col: StructField, replacement: T): Column = {
col.dataType match {
case DoubleType | FloatType =>
coalesce(nanvl(df.col("`" + col.name + "`"), lit(null)),
lit(replacement).cast(col.dataType)).as(col.name)
case _ =>
coalesce(df.col("`" + col.name + "`"), lit(replacement).cast(col.dataType)).as(col.name)
}
}
The relevant source code for casting is here (similar code for Floats):
Cast.scala (taken from Spark 1.6.3):
// DoubleConverter
private[this] def castToDouble(from: DataType): Any => Any = from match {
case StringType =>
buildCast[UTF8String](_, s => try s.toString.toDouble catch {
case _: NumberFormatException => null
})
case BooleanType =>
buildCast[Boolean](_, b => if (b) 1d else 0d)
case DateType =>
buildCast[Int](_, d => null)
case TimestampType =>
buildCast[Long](_, t => timestampToDouble(t))
case x: NumericType =>
b => x.numeric.asInstanceOf[Numeric[Any]].toDouble(b)
}
So Spark tries to convert String
to Double
( s.toString.toDouble
), if that's not possible (i.e. you get NumberFormatException
) you get null
. So instead "null"
you can also use "foo"
with the same result. But if you use "1.0"
on your card, then NaNs
u nulls
will be replaced by 1.0
, because you String
"1.0"
can actually divide by Double
.
source to share
It is not that "null", as a string, is converted to an object null
. You can try converting with any String and still get it null
(except for strings that can be directly grouped into double / float, see below). For example using
val map = df.columns.map((_, "abc")).toMap
will give the same result. I assume that since columns are of type float
and double
converting values NaN
to string will give null
. Using a number instead will work as expected, eg.
val map = df.columns.map((_, 1)).toMap
Since some strings can be directly appended to double
or float
, in this case can also be used.
val map = df.columns.map((_, "1")).toMap
source to share