Fixed bug when accessing empty or null arrays

I have a JSON file with this type of schema:

{
 "name" : "john doe",
 "phone-numbers" : {
   "home": ["1111", "222"],
   "country" : "England" 
  }
}

      

The array of home phone numbers can sometimes be empty.

My spark app gets a list of these JSONS and does the following:

val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.select($"name", 
                               explode(dataframe.col("phone-numbers.home")))

      

When the "home" array is empty, I get the following error when I try to blow it up:

org.apache.spark.sql.AnalysisException: Unable to resolve ' phone-numbers

[' home ']' due to datatype mismatch: argument 2 requires an integral type, but 'home' is of string type. ;;

Is there an elegant way to prevent a spark from exploding this field if empty or empty?

+3


source to share


2 answers


The problem is not empty arrays ( "home" : []

), but arrays that are null ( "home" : null

) which don't work withexplode

So, first, filter the null values ​​first:

val result = df
   .filter($"phone-numbers.home".isNotNull)
   .select($"name", explode($"phone-numbers.home"))

      



or replace null values ​​with an empty array (which I would prefer in your situation):

val nullToEmptyArr = udf(
   (arr:Array[Long]) => if(arr==null) Array.empty[Long] else arr
)

val result = df
  .withColumn("phone-numbers.home",nullToEmptyArr($"phone-numbers.home")) // clean existing column
  .select($"name", explode($"phone-numbers.home"))

      

+2


source


There is a class in sparks called DataFrameNaFunctions

, this class is specialized for dealing with missing data in DataFrame

s.

this class contains three main methods : drop

, replace

andfill

to use these methods, the only thing you need to do is call the method df.na

that returns DataFrameNaFunctions

for yours df

, then apply one of the three methods that yours returns df

with the specified operation.



to solve your problem, you can use something like this:

val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.na.drop().select("name", 
                           explode(dataframe.col("phone-numbers.home")))

      

Hope for this help, best wishes

+2


source







All Articles