Org.apache.spark.sql.AnalysisException: Unable to resolve given input columns
exitTotalDF
.filter($"accid" === "dc215673-ef22-4d59-0998-455b82000015")
.groupBy("exiturl")
.agg(first("accid"), first("segment"), $"exiturl", sum("session"), sum("sessionfirst"), first("date"))
.orderBy(desc("session"))
.take(500)
org.apache.spark.sql.AnalysisException: cannot resolve '`session`' given input columns: [first(accid, false), first(date, false), sum(session), exiturl, sum(sessionfirst), first(segment, false)]
Its like the sum function cannot find the column names correctly.
Using Spark 2.1
source to share
Typically, in scenarios like this, I use the as
on-column method . For example .agg(first("accid"), first("segment"), $"exiturl", sum("session").as("session"), sum("sessionfirst"), first("date"))
. This gives you more control over what to expect, and if the summation name were to ever change in future versions of spark, you have less headache updating all the names in your dataset.
Also, I just ran a simple test. When you don't provide a name, it looks like this: the name in Spark 2.1 changes to "sum (session)". One way to find this is to call printSchema on the dataset.
source to share
I prefer to use withColumnRenamed()
instead as()
because:
With as()
you need to specify all the columns it needs:
df.select(first("accid"),
first("segment"),
$"exiturl",
col('sum("session")').as("session"),
sum("sessionfirst"),
first("date"))
VS withColumnRenamed
- one liner:
df1 = df.withColumnRenamed('sum("session")', "session")
The output df1
will contain all the columns that df has, except that the sum ("session") column has now been renamed to "session"
source to share