Fixed bug when accessing empty or null arrays

Question

Fixed bug when accessing empty or null arrays

I have a JSON file with this type of schema:

{
 "name" : "john doe",
 "phone-numbers" : {
   "home": ["1111", "222"],
   "country" : "England" 
  }
}

The array of home phone numbers can sometimes be empty.

My spark app gets a list of these JSONS and does the following:

val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.select($"name", 
                               explode(dataframe.col("phone-numbers.home")))

When the "home" array is empty, I get the following error when I try to blow it up:

org.apache.spark.sql.AnalysisException: Unable to resolve ' phone-numbers

[' home ']' due to datatype mismatch: argument 2 requires an integral type, but 'home' is of string type. ;;

Is there an elegant way to prevent a spark from exploding this field if empty or empty?

+3

java json scala apache-spark

mormayr May 21 '17 at 8:17

source to share

2 answers

There is a class in sparks called DataFrameNaFunctions

, this class is specialized for dealing with missing data in DataFrame

s.

this class contains three main methods : drop

, replace

andfill

to use these methods, the only thing you need to do is call the method df.na

that returns DataFrameNaFunctions

for yours df

, then apply one of the three methods that yours returns df

with the specified operation.

to solve your problem, you can use something like this:

val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.na.drop().select("name", 
                           explode(dataframe.col("phone-numbers.home")))

Hope for this help, best wishes

+2

Haroun mohammedi May 21 '17 at 8:45

source to share

Raphael Roth · Accepted Answer · 2017-05-21T11:57:43+0000

The problem is not empty arrays ( "home" : []

), but arrays that are null ( "home" : null

) which don't work withexplode

So, first, filter the null values first:

val result = df
   .filter($"phone-numbers.home".isNotNull)
   .select($"name", explode($"phone-numbers.home"))

or replace null values with an empty array (which I would prefer in your situation):

val nullToEmptyArr = udf(
   (arr:Array[Long]) => if(arr==null) Array.empty[Long] else arr
)

val result = df
  .withColumn("phone-numbers.home",nullToEmptyArr($"phone-numbers.home")) // clean existing column
  .select($"name", explode($"phone-numbers.home"))

Fixed bug when accessing empty or null arrays

More articles: