Spark from_json - StructType and ArrayType
I have a dataset that comes as XML and one of the nodes contains JSON. Spark reads this as a StringType, so I am trying to use from_json () to convert JSON to DataFrame.
I can convert a JSON string, but how do I write a schema to work with an array?
String without array - works fine
import org.apache.spark.sql.functions._
val schemaExample = new StructType()
.add("FirstName", StringType)
.add("Surname", StringType)
val dfExample = spark.sql("""select "{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }" as theJson""")
val dfICanWorkWith = dfExample.select(from_json($"theJson", schemaExample))
dfICanWorkWith.collect()
// Results \\
res19: Array[org.apache.spark.sql.Row] = Array([[Johnny,Boy]])
String with an array - this variant cannot be determined
import org.apache.spark.sql.functions._
val schemaExample2 = new StructType()
.add("", ArrayType(new StructType()
.add("FirstName", StringType)
.add("Surname", StringType)
)
)
val dfExample2= spark.sql("""select "[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }" as theJson""")
val dfICanWorkWith = dfExample2.select(from_json($"theJson", schemaExample2))
dfICanWorkWith.collect()
// Result \\
res22: Array[org.apache.spark.sql.Row] = Array([null])
+2
source to share
2 answers
The problem is you don't have a fully qualified json. Your json is missing a few things:
- First you are missing the {} environment where json is executed
- Second, you are missing the value of the variable (you set it to "" but not add it)
- Finally, you are missing the closure]
Try replacing it with:
val dfExample2= spark.sql("""select "{\"\":[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]}" as theJson""")
and you get:
scala> dfICanWorkWith.collect()
res12: Array[org.apache.spark.sql.Row] = Array([[WrappedArray([Johnny,Boy], [Franky,Man])]])
+8
source to share
since version 2.4 the function schema_of_json
helps:
> SELECT schema_of_json('[{"col":0}]');
array<struct<col:int>>
in your case, you can use the below code to parse this array of son objects:
scala> spark.sql("""select from_json("[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]", 'array<struct<FirstName:string,Surname:string>>' ) as theJson""").show(false) +------------------------------+ |theJson | +------------------------------+ |[[Johnny, Boy], [Franky, Man]]| +------------------------------+
0
source to share