Some (null) for Stringtype nullable scala.matcherror

I have RDD[(Seq[String], Seq[String])]

with some null values ​​in the data. RDD converted to dataframe looks like this:

+----------+----------+
|      col1|      col2|
+----------+----------+
|[111, aaa]|[xx, null]|
+----------+----------+

      

Below is a sample code:

val rdd = sc.parallelize(Seq((Seq("111","aaa"),Seq("xx",null))))
val df = rdd.toDF("col1","col2")
val keys = Array("col1","col2")
val values = df.flatMap {
    case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
    case Row(_, null) => None
}
val transposed = values.map(someFunc(keys))

val schema = StructType(keys.map(name => StructField(name, DataTypes.StringType, nullable = true)))

val transposedDf = sc.createDataFrame(transposed, schema)

transposed.show()

      

It works fine until I create a transposedDF, however, as soon as I click, it shows that it throws the following error:

scala.MatchError: null
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:97)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
        at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
        at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)

      

If there are no null values ​​in rdd, the code works fine. I don't understand why this happens when I have null values, because I am specifying the nullable StringType schema as true. Am I doing something wrong? I am using spark 1.6.1 and scala 2.10

+3


source to share


3 answers


Pattern matching is linear as it appears in sources, so this line:

case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)

      



which has no restrictions on the values ​​of t1 and t2, never has a null value.

Effectively, put the null check early and it should work.

+1


source


The problem is what you find null

or does not match the first pattern. In the end, in t2: Seq[String]

theory, maybe null

. While it is true that you can solve this immediately by simply making the template null

first, I find it necessary to use Scala tools to get rid of null

and avoid any nasty runtime surprises altogether .

So, you can do something like this:



def foo(s: Seq[String]) = if (s.contains(null)) None else Some(s)
//or you could do fancy things with filter/filterNot

df.map {
   case (first, second) => (foo(first), foo(second))
}

      

This will give you the Some

/ tags None

that you seem to want, but I would take a look at aligning them None

.

+1


source


I think you will need to encode null values ​​into empty or special string before doing assert operations. Also keep in mind that Spark is lazy. So because of this value "val values ​​= df.flatMap" everything is only executed when show () is executed.

0


source







All Articles