Can't filter DataFrame using Window function in Spark

I'm trying to use a window function based boolean expression to detect duplicate entries:

df
.where(count("*").over(Window.partitionBy($"col1",$"col2"))>lit(1))
.show

      

this gives in Spark 2.1.1:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate

      

on the other hand, it works if I assign the result of the window function to a new column and then filter that column:

df
.withColumn("count", count("*").over(Window.partitionBy($"col1",$"col2"))
.where($"count">lit(1)).drop($"count")
.show

      

I wonder how I can write this without using a temporary column?

+3


source to share


1 answer


I think the window functions cannot be used in the filter. You have to create an additional column and filter it.

What you can do is make the window function listed.



df.select(col("1"), col("2"), lag(col("2"), 1).over(window).alias("2_lag"))).filter(col("2_lag")==col("2"))

      

Then you have it in one statement.

+1


source







All Articles