Some (null) for Stringtype nullable scala.matcherror
I have RDD[(Seq[String], Seq[String])]
with some null values in the data. RDD converted to dataframe looks like this:
+----------+----------+
| col1| col2|
+----------+----------+
|[111, aaa]|[xx, null]|
+----------+----------+
Below is a sample code:
val rdd = sc.parallelize(Seq((Seq("111","aaa"),Seq("xx",null))))
val df = rdd.toDF("col1","col2")
val keys = Array("col1","col2")
val values = df.flatMap {
case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
val transposed = values.map(someFunc(keys))
val schema = StructType(keys.map(name => StructField(name, DataTypes.StringType, nullable = true)))
val transposedDf = sc.createDataFrame(transposed, schema)
transposed.show()
It works fine until I create a transposedDF, however, as soon as I click, it shows that it throws the following error:
scala.MatchError: null
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:97)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
If there are no null values in rdd, the code works fine. I don't understand why this happens when I have null values, because I am specifying the nullable StringType schema as true. Am I doing something wrong? I am using spark 1.6.1 and scala 2.10
source to share
The problem is what you find null
or does not match the first pattern. In the end, in t2: Seq[String]
theory, maybe null
. While it is true that you can solve this immediately by simply making the template null
first, I find it necessary to use Scala tools to get rid of null
and avoid any nasty runtime surprises altogether .
So, you can do something like this:
def foo(s: Seq[String]) = if (s.contains(null)) None else Some(s)
//or you could do fancy things with filter/filterNot
df.map {
case (first, second) => (foo(first), foo(second))
}
This will give you the Some
/ tags None
that you seem to want, but I would take a look at aligning them None
.
source to share