Why does Spark SQL translate String "null" to Object null for Float / Double types?

I have a dataframe containing values float

and double

.

scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]

scala> df.show
+---+---+
|  x|  y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+

scala> df.printSchema
root
 |-- x: float (nullable = false)
 |-- y: double (nullable = false)

      

When I replace values โ€‹โ€‹with NaN

value null

, I gave null

as String for Map operation in fill

.

scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)

scala> df.na.fill(map).printSchema
root
 |-- x: float (nullable = true)
 |-- y: double (nullable = true)


scala> df.na.fill(map).show
+----+----+
|   x|   y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+

      

And I got the correct value. But I couldn't figure out how and why Spark SQL translates null

as String object to object null

?

+3


source to share


3 answers


If you reviewed the function fill

in Dataset

, it checks the data type and tries to convert to the data type of its column schema. If it can be converted, it is converted, otherwise it returns null.

It does not convert to " null

" an object null

, but returns null if an exception occurs during conversion.

val map = df.columns.map((_, "WHATEVER")).toMap

      

gives null



and val map = df.columns.map((_, "9999.99")).toMap

      

gives 9999.99

If you want to update NAN

with the same data type, you can get the result as expected.

Hope this helps you understand!

+2


source


I looked at the source code, fill

your string was added to double / float:

private def fillCol[T](col: StructField, replacement: T): Column = {
    col.dataType match {
      case DoubleType | FloatType =>
        coalesce(nanvl(df.col("`" + col.name + "`"), lit(null)),
          lit(replacement).cast(col.dataType)).as(col.name)
      case _ =>
        coalesce(df.col("`" + col.name + "`"), lit(replacement).cast(col.dataType)).as(col.name)
    }
  }

      

The relevant source code for casting is here (similar code for Floats):



Cast.scala (taken from Spark 1.6.3):

  // DoubleConverter
  private[this] def castToDouble(from: DataType): Any => Any = from match {
    case StringType =>
      buildCast[UTF8String](_, s => try s.toString.toDouble catch {
        case _: NumberFormatException => null
      })
    case BooleanType =>
      buildCast[Boolean](_, b => if (b) 1d else 0d)
    case DateType =>
      buildCast[Int](_, d => null)
    case TimestampType =>
      buildCast[Long](_, t => timestampToDouble(t))
    case x: NumericType =>
      b => x.numeric.asInstanceOf[Numeric[Any]].toDouble(b)
  }

      

So Spark tries to convert String

to Double

( s.toString.toDouble

), if that's not possible (i.e. you get NumberFormatException

) you get null

. So instead "null"

you can also use "foo"

with the same result. But if you use "1.0"

on your card, then NaNs

u nulls

will be replaced by 1.0

, because you String

"1.0"

can actually divide by Double

.

+1


source


It is not that "null", as a string, is converted to an object null

. You can try converting with any String and still get it null

(except for strings that can be directly grouped into double / float, see below). For example using

val map = df.columns.map((_, "abc")).toMap

      

will give the same result. I assume that since columns are of type float

and double

converting values NaN

to string will give null

. Using a number instead will work as expected, eg.

val map = df.columns.map((_, 1)).toMap

      

Since some strings can be directly appended to double

or float

, in this case can also be used.

val map = df.columns.map((_, "1")).toMap

      

0


source







All Articles