Pyspark window function first_value

I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions.

According to this there is an analytic function called firstValue

which will give me the first non-zero value for a given window. I know this exists in Hive, but I cannot find it in pyspark anywhere.

Is there a way to implement this given that pyspark won't allow UserDefinedAggregateFunctions (UDAF)?

+2


source to share


1 answer


Spark> = 2.0 :

first

takes an optional argument ignorenulls

that can simulate the behavior first_value

:

df.select(col("k"), first("v", True).over(w).alias("fv"))

      

Spark & โ€‹โ€‹lt; 2.0 :



The available function is named first

and can be used as follows:

df = sc.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3)
]).toDF(["k", "v"])

w = Window().partitionBy("k").orderBy("v")

df.select(col("k"), first("v").over(w).alias("fv"))

      

but if you want to ignore zeros, you have to use UUF Hive directly:

df.registerTempTable("df")

sqlContext.sql("""
    SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
    FROM df""")

      

+2


source







All Articles