Specifying col type in Sparklyr (spark_read_csv)
I am reading in csv to spark using SpraklyR
schema <- structType(structField("TransTime", "array<timestamp>", TRUE),
structField("TransDay", "Date", TRUE))
spark_read_csv(sc, filename, "path", infer_schema = FALSE, schema = schema)
But we get:
Error: could not find function "structType"
How can I specify colunm types using spark_read_csv?
Thanks in advance.
+3
Levi brackman
source
to share
2 answers
The structType function from Scala SparkAPI, in Sparklyr, to specify the data type that you must pass in the "column" argument as a list, suppose we have the following CSV (data.csv):
name,birthdate,age,height
jader,1994-10-31,22,1.79
maria,1900-03-12,117,1.32
Function to read the corresponding data:
mycsv <- spark_read_csv(sc, "mydate",
path = "data.csv",
memory = TRUE,
infer_schema = FALSE, #attention to this
columns = list(
name = "character",
birthdate = "date", #or character because needs date functions
age = "integer",
height = "double"))
# integer = "INTEGER"
# double = "REAL"
# character = "STRING"
# logical = "INTEGER"
# list = "BLOB"
# date = character = "STRING" # not sure
To control the data type, you must use the hive date functions , not the R functions.
mycsv %>% mutate(birthyear = year(birthdate))
Link: https://spark.rstudio.com/articles/guides-dplyr.html#hive-functions
+5
Jader Martins
source
to share
we have an example of how to do this in one of our articles on the sparklyr official site, here is the link: http://spark.rstudio.com/example-s3.html#data_import
+2
edgararuiz
source
to share