Adding Constant Structure Column to Spark DataFrame

I want to load struct

from a database collection and attach it as a constant column to every row in the target DataFrame

.

I can load the column I want as DataFrame

a single row and then crossJoin

insert it on each row of the target:

val parentCollectionDF = /* ... load a single row from the database */
val constantCol = broadcast(parentCollectionDF.select("my_column"))
val result = childCollectionDF.crossJoin(constantCol)

      

It works, but it seems wasteful: the data is constant for every row of the child collection, but crossJoin copies it to every row.

If I could hard-code the values, I could use something like childCollection.withColumn("my_column", struct(lit(val1) as "field1", lit(val2) as "field2" /* etc. */))

But I don't know them ahead of time; I need to load a structure from a parent collection.

What I'm looking for is something like:

childCollection.withColumn("my_column",
  lit(parentCollectionDF.select("my_column").take(1).getStruct(0))

      

... but I can see from the code for literals that lit()

only base types can be used as an argument for . It is not recommended to pass GenericRowWithSchema or case class here.

Is there a less clumsy way to do this? (Spark 2.1.1, Scala)

[edit: Not the same as this question , which explains how to add a struct with literal (hardcoded) constants. My structure should be loaded dynamically.]

+3


source to share





All Articles