Apache Spark behavior with regular files

Question

Apache Spark behavior with regular files

I have a spark application that reads data from a local file.

JavaRDD<String> file = context.textFile(input);

Since this is not a distributed file, I need a copy of the input file at every node in the same path. Does this mean that each node will process the entire file? If so, is there a way to keep the file unallocated and force the nodes to process different parts of the file without a copy on each node? Thank!

+3

java scala hadoop apache-spark

Roxana roman May 21 '15 @ 8:29 am

source to share

1 answer

maasg · Accepted Answer · 2015-05-21T09:37:54+0000

Assuming the file will be placed in driver memory, it can be loaded using native code and then converted to RDD using context.parallelize(data)

to distribute it across the cluster for further parallel processing.

In scala, you can do something like this:

val lines = Source.fromFile("/path/to/local/file").getLines()
val rdd = sparkContext.parallelize(lines.toSeq, numOfPartitions)

Apache Spark behavior with regular files

More articles: