Apache Spark behavior with regular files
I have a spark application that reads data from a local file.
JavaRDD<String> file = context.textFile(input);
Since this is not a distributed file, I need a copy of the input file at every node in the same path. Does this mean that each node will process the entire file? If so, is there a way to keep the file unallocated and force the nodes to process different parts of the file without a copy on each node? Thank!
+3
source to share
1 answer
Assuming the file will be placed in driver memory, it can be loaded using native code and then converted to RDD using context.parallelize(data)
to distribute it across the cluster for further parallel processing.
In scala, you can do something like this:
val lines = Source.fromFile("/path/to/local/file").getLines()
val rdd = sparkContext.parallelize(lines.toSeq, numOfPartitions)
0
source to share