Error reading file in Spark
I'm having a hard time figuring out why Spark doesn't have access to the file that I'm adding to the context. Below is my code in repl:
scala> sc.addFile("/home/ubuntu/my_demo/src/main/resources/feature_matrix.json")
scala> val featureFile = sc.textFile(SparkFiles.get("feature_matrix.json"))
featureFile: org.apache.spark.rdd.RDD[String] = /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json MappedRDD[1] at textFile at <console>:60
scala> featureFile.first()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: cfs://172.30.26.95/tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json
The file does exist in /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json
Any help was appreciated.
source to share
If you are using addFile
, you need to use get
to get it. Also, the method is addFile
lazy, so it is very possible that it was not placed in the place you find until you actually name it first
, so you create a circle like this.
All that has been said, I don't know what the use is SparkFiles
, since the first action will always be a smart idea. Use something like --files
c SparkSubmit
and the files will be placed in your working directory.
source to share