Add a new column to the DataFrame database over an existing column

I have a csv file with a datetime column: "2011-05-02T04: 52: 09 + 00: 00".

I am using scala, the file is loaded into a spark DataFrame and I can use jodas time to parse the date:

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = new SQLContext(sc).load("com.databricks.spark.csv", Map("path" -> "data.csv", "header" -> "true")) 
val d = org.joda.time.format.DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")

      

I would like to create a new column base on a datetime field for time parsing.

In a DataFrame, how to create a base of columns by the value of another column?

I notice that the DataFrame has the following function: df.withColumn ("dt", column), is there a way to create a column base from the value of an existing column?

thank

+3


source to share


1 answer


import org.apache.spark.sql.types.DateType
import org.apache.spark.sql.functions._
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat

val d = DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")
val dtFunc: (String => Date) = (arg1: String) => DateTime.parse(arg1, d).toDate
val x = df.withColumn("dt", callUDF(dtFunc, DateType, col("dt_string")))

      

callUDF

, col

Included functions

as import

show

dt_string

inside col("dt_string")

is the column name of the starting column of your df that you want to convert from.



Alternatively, you can replace the last statement with the following:

val dtFunc2 = udf(dtFunc)
val x = df.withColumn("dt", dtFunc2(col("dt_string")))

      

+7


source







All Articles