Applying the function to the entire spark column

I made this code, my question is how to use the cast datatype, how can I cast the entire column included in the dataset to the same level as the timestamp of the column, and another question is how to apply the avg function for all column except also time column. Many thanks

val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest.csv")
val result=df.withColumn("new_time",((unix_timestamp(col("time")) /300).cast("long") * 300).cast("timestamp"))
result("value").cast("float")//here the first question 
val finalresult=result.groupBy("new_time").agg(avg("value")).sort("new_time")//here the second question about avg
finalresult.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("C:/mydata.csv")

      

+1


source to share


1 answer


It's pretty easy to implement in pyspark, but I ran into the need to rewrite this in scala code ... I hope you can handle it somehow.



from pyspark.sql.functions import *
df = spark.createDataFrame([(100, "4.5", "5.6")], ["new_time", "col1", "col2"])
columns = [col(c).cast('float') if c != 'new_time' else col(c) for c in df.columns]
aggs = [avg(c) for c in df.columns if c != 'new_time']
finalresult = df.select(columns).groupBy('new_time').agg(*aggs)
finalresult.explain()

*HashAggregate(keys=[new_time#0L], functions=[avg(cast(col1#14 as double)), avg(cast(col2#15 as double))])
+- Exchange hashpartitioning(new_time#0L, 200)
   +- *HashAggregate(keys=[new_time#0L], functions=[partial_avg(cast(col1#14 as double)), partial_avg(cast(col2#15 as double))])
      +- *Project [new_time#0L, cast(col1#1 as float) AS col1#14, cast(col2#2 as float) AS col2#15]
         +- Scan ExistingRDD[new_time#0L,col1#1,col2#2]

      

0


source







All Articles