Spark - Scala - Replace a value in a dataframe with a lookup value from another dataframe

I am working with Spark at Databricks. Scala programming language.

I have two data frames:

  • Main data frame: see screenshot: 1
  • Search Data Frame: See Screenshot 3

I would like to:

  • Find all rows where Age == - 1 in the main data frame
  • Look at the "title" value of this line.
  • Look in dataframe 2 to see what is the average age for people with this name
  • Update the age in the main data frame with this value.

I shattered my head on how to do this. The only thing I came up with was storing data as a table in databricks and using SQL statements (sql.Context.Sql ...) which turned out to be very complicated.

I am wondering if there is a more efficient way to do this.

Edit: adding a reproducible example

import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("Fred", 20, "Intern"), ("Linda", -1, "Manager"),  ("Sean", 23, "Junior Employee"), ("Walter", 35, "Manager"), ("Kate", -1, "Junior Employee"), ("Kathrin", 37, "Manager"), ("Bob", 16, "Intern"), ("Lukas", 24, "Junionr Employee")))
    .toDF("Name", "Age", "Title")

println("Data Frame DF")
df.show();


val avgAge = df.filter("Age!=-1").groupBy("Title").agg(avg("Age").alias("avg_age")).toDF()
println("Average Ages")
avgAge.show()

println("Missing Age")
val noAge = df.filter("Age==-1").toDF()
noAge.show()

      

Solution thanks to Karol Sudol

val imputedAges = df.filter("Age == -1").join(avgAge, Seq("Title")).select(col("Name"),col("avg_age"), col("Title") )
imputedAges.show()

val finalDF= imputedAges.union(df.filter("Age!=-1"))
println("FinalDF")
finalDF.show()

      

+3


source to share


1 answer


val df = dfMain.filter("age == -1").join(dfLookUp, Seq("title")).select(col("title"), col("avg"), ......)

      

use left/right/outer join

in next step with main DF

if you want to keep any other values.



Complete the tutorials: learning databricks

+1


source







All Articles