Sparklyr exception when reading csv file by outputting schema: double
I am trying to read csv in Spark using a function spark_read_csv
. I get an exception when dumping the schema, I get an exception when I install infer_schema=TRUE
.
spark_read_csv(sc,"myDf",DatasetUrl)
I am getting the following exception:
Error: org.apache.spark.SparkException: Work aborted due to phase failure: Task 0 in phase 90.0 failed 1 time, last failure: Lost task 0.0 in phase 90.0 (TID 151, localhost): java.text. ParseException: Unparseable number: "cr1_fd_dttm" in java.text.NumberFormat.parse (NumberFormat.java:385) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast $$ anonfun $ castToD $ 4.apply $ mcc sp (CSVInferSchema.scala: 259)
However, when I try to install infer_schema=FALSE
as expected, everything reads as chr
.
This is what the data in the column looks like cr1_fd_dttm
:
cr1_fd_dttm
<chr>
1 0.0
2 1.45679112E12
3 1.45679166E12
4 1.45679154E12
5 1.45679274E12
6 0.0
7 0.0
8 0.0
9 0.0
10 1.45679118E12
Can someone help me with this?
thank
source to share
I just read the file without immediately putting it into memory, forcibly increased the number of fields, and then loaded those results into memory. Thus, keys must be set memory
to FALSE, infer_schema
to FALSE, passed a list of columns, coerce, and then used compute()
to store the results in Spark memory. Here's a long but working example:
mapped_flights <- spark_read_csv(sc, "mapped_flights",
path = "s3a://flights-data/full",
memory = FALSE,
infer_schema = FALSE,
columns = list(
Year = "character",
Month = "character",
DayofMonth = "character",
DayOfWeek = "character",
DepTime = "character",
CRSDepTime = "character",
ArrTime = "character",
CRSArrTime = "character",
UniqueCarrier = "character",
FlightNum = "character",
TailNum = "character",
ActualElapsedTime = "character",
CRSElapsedTime = "character",
AirTime = "character",
ArrDelay = "character",
DepDelay = "character",
Origin = "character",
Dest = "character",
Distance = "character",
TaxiIn = "character",
TaxiOut = "character",
Cancelled = "character",
CancellationCode = "character",
Diverted = "character",
CarrierDelay = "character",
WeatherDelay = "character",
NASDelay = "character",
SecurityDelay = "character",
LateAircraftDelay = "character")
)
flights <- mapped_flights %>% mutate(
Year = as.integer(Year),
Month = as.integer(Month),
DayofMonth = as.integer(DayofMonth),
DayOfWeek = as.integer(DayOfWeek),
DepTime = as.integer(DepTime),
CRSDepTime = as.integer(CRSDepTime),
CRSArrTime = as.integer(CRSArrTime),
ArrTime = as.integer(ArrTime),
ActualElapsedTime = as.integer(ActualElapsedTime),
CRSElapsedTime = as.integer(CRSElapsedTime),
AirTime = as.integer(AirTime),
ArrDelay = as.double(ArrDelay),
DepDelay = as.double(DepDelay),
Distance = as.integer(Distance),
TaxiIn = as.integer(TaxiIn),
TaxiOut = as.integer(TaxiOut),
Cancelled = as.integer(Cancelled),
Diverted = as.integer(Diverted),
CarrierDelay = as.integer(CarrierDelay),
WeatherDelay = as.integer(WeatherDelay),
NASDelay = as.integer(NASDelay),
SecurityDelay = as.integer(SecurityDelay),
LateAircraftDelay = as.integer(LateAircraftDelay)) %>% compute("flights")
source to share