Some (null) for Stringtype nullable scala.matcherror

Question

Some (null) for Stringtype nullable scala.matcherror

I have RDD[(Seq[String], Seq[String])]

with some null values in the data. RDD converted to dataframe looks like this:

+----------+----------+
|      col1|      col2|
+----------+----------+
|[111, aaa]|[xx, null]|
+----------+----------+

Below is a sample code:

val rdd = sc.parallelize(Seq((Seq("111","aaa"),Seq("xx",null))))
val df = rdd.toDF("col1","col2")
val keys = Array("col1","col2")
val values = df.flatMap {
    case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
    case Row(_, null) => None
}
val transposed = values.map(someFunc(keys))

val schema = StructType(keys.map(name => StructField(name, DataTypes.StringType, nullable = true)))

val transposedDf = sc.createDataFrame(transposed, schema)

transposed.show()

It works fine until I create a transposedDF, however, as soon as I click, it shows that it throws the following error:

scala.MatchError: null
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:97)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
        at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
        at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)

If there are no null values in rdd, the code works fine. I don't understand why this happens when I have null values, because I am specifying the nullable StringType schema as true. Am I doing something wrong? I am using spark 1.6.1 and scala 2.10

+3

scala apache-spark

Pratik shah 05 Apr 17 at 21:34

source to share

3 answers

pedrofurla · Answer 1 · 2017-04-05T21:38:38+0000

Pattern matching is linear as it appears in sources, so this line:

case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)

which has no restrictions on the values of t1 and t2, never has a null value.

Effectively, put the null check early and it should work.

Vidya · Answer 2 · 2017-04-05T23:13:33+0000

The problem is what you find null

or does not match the first pattern. In the end, in t2: Seq[String]

theory, maybe null

. While it is true that you can solve this immediately by simply making the template null

first, I find it necessary to use Scala tools to get rid of null

and avoid any nasty runtime surprises altogether .

So, you can do something like this:

def foo(s: Seq[String]) = if (s.contains(null)) None else Some(s)
//or you could do fancy things with filter/filterNot

df.map {
   case (first, second) => (foo(first), foo(second))
}

This will give you the Some

/ tags None

that you seem to want, but I would take a look at aligning them None

.

alpeshpandya · Answer 3 · 2017-04-05T22:40:02+0000

I think you will need to encode null values into empty or special string before doing assert operations. Also keep in mind that Spark is lazy. So because of this value "val values = df.flatMap" everything is only executed when show () is executed.

Some (null) for Stringtype nullable scala.matcherror

More articles: