Pyspark window function first_value
I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions.
According to this there is an analytic function called firstValue
which will give me the first non-zero value for a given window. I know this exists in Hive, but I cannot find it in pyspark anywhere.
Is there a way to implement this given that pyspark won't allow UserDefinedAggregateFunctions (UDAF)?
source to share
Spark> = 2.0 :
first
takes an optional argument ignorenulls
that can simulate the behavior first_value
:
df.select(col("k"), first("v", True).over(w).alias("fv"))
Spark & โโlt; 2.0 :
The available function is named first
and can be used as follows:
df = sc.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3)
]).toDF(["k", "v"])
w = Window().partitionBy("k").orderBy("v")
df.select(col("k"), first("v").over(w).alias("fv"))
but if you want to ignore zeros, you have to use UUF Hive directly:
df.registerTempTable("df")
sqlContext.sql("""
SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
FROM df""")
source to share