Org.apache.spark.sql.AnalysisException: Unable to resolve given input columns

exitTotalDF
  .filter($"accid" === "dc215673-ef22-4d59-0998-455b82000015")
  .groupBy("exiturl")
  .agg(first("accid"), first("segment"), $"exiturl", sum("session"), sum("sessionfirst"), first("date"))
  .orderBy(desc("session"))
  .take(500)

org.apache.spark.sql.AnalysisException: cannot resolve '`session`' given input columns: [first(accid, false), first(date, false),  sum(session), exiturl, sum(sessionfirst), first(segment, false)]

      

Its like the sum function cannot find the column names correctly.

Using Spark 2.1

+4


source to share


3 answers


Typically, in scenarios like this, I use the as

on-column method . For example .agg(first("accid"), first("segment"), $"exiturl", sum("session").as("session"), sum("sessionfirst"), first("date"))

. This gives you more control over what to expect, and if the summation name were to ever change in future versions of spark, you have less headache updating all the names in your dataset.



Also, I just ran a simple test. When you don't provide a name, it looks like this: the name in Spark 2.1 changes to "sum (session)". One way to find this is to call printSchema on the dataset.

+6


source


I prefer to use withColumnRenamed()

instead as()

because:

With as()

you need to specify all the columns it needs:

    df.select(first("accid"), 
          first("segment"),
          $"exiturl", 
          col('sum("session")').as("session"),
          sum("sessionfirst"),
          first("date"))

      



VS withColumnRenamed

- one liner:

    df1 = df.withColumnRenamed('sum("session")', "session")

      

The output df1

will contain all the columns that df has, except that the sum ("session") column has now been renamed to "session"

+3


source


From spark2.0 spark-shell is enabled from hive by default. We can disable hive support using the command below.

spark-shell --conf spark.sql.catalogImplementation=in-memory

      

0


source







All Articles