Pyspark window function first_value

Question

Pyspark window function first_value

I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions.

According to this there is an analytic function called firstValue

which will give me the first non-zero value for a given window. I know this exists in Hive, but I cannot find it in pyspark anywhere.

Is there a way to implement this given that pyspark won't allow UserDefinedAggregateFunctions (UDAF)?

+2

apache-spark pyspark window-functions apache-spark-sql

liber 01 Feb 16 at 23:12

source to share

1 answer

zero323 · Accepted Answer · 2016-02-02T00:16:23+0000

Spark> = 2.0 :

first

takes an optional argument ignorenulls

that can simulate the behavior first_value

:

df.select(col("k"), first("v", True).over(w).alias("fv"))

Spark & lt; 2.0 :

The available function is named first

and can be used as follows:

df = sc.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3)
]).toDF(["k", "v"])

w = Window().partitionBy("k").orderBy("v")

df.select(col("k"), first("v").over(w).alias("fv"))

but if you want to ignore zeros, you have to use UUF Hive directly:

df.registerTempTable("df")

sqlContext.sql("""
    SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
    FROM df""")

Pyspark window function first_value

More articles: