Can't filter DataFrame using Window function in Spark

Question

Can't filter DataFrame using Window function in Spark

I'm trying to use a window function based boolean expression to detect duplicate entries:

df
.where(count("*").over(Window.partitionBy($"col1",$"col2"))>lit(1))
.show

this gives in Spark 2.1.1:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate

on the other hand, it works if I assign the result of the window function to a new column and then filter that column:

df
.withColumn("count", count("*").over(Window.partitionBy($"col1",$"col2"))
.where($"count">lit(1)).drop($"count")
.show

I wonder how I can write this without using a temporary column?

+3

scala apache-spark

Raphael Roth June 15. 17 at 6:29 am

source to share

1 answer

Andi Anderle · Answer 1 · 2017-11-16T09:56:54+0000

I think the window functions cannot be used in the filter. You have to create an additional column and filter it.

What you can do is make the window function listed.

df.select(col("1"), col("2"), lag(col("2"), 1).over(window).alias("2_lag"))).filter(col("2_lag")==col("2"))

Then you have it in one statement.

Can't filter DataFrame using Window function in Spark

More articles: