Apache Spark behavior with regular files

I have a spark application that reads data from a local file.

JavaRDD<String> file = context.textFile(input);

      

Since this is not a distributed file, I need a copy of the input file at every node in the same path. Does this mean that each node will process the entire file? If so, is there a way to keep the file unallocated and force the nodes to process different parts of the file without a copy on each node? Thank!

+3
java scala hadoop apache-spark


source to share


1 answer


Assuming the file will be placed in driver memory, it can be loaded using native code and then converted to RDD using context.parallelize(data)

to distribute it across the cluster for further parallel processing.

In scala, you can do something like this:



val lines = Source.fromFile("/path/to/local/file").getLines()
val rdd = sparkContext.parallelize(lines.toSeq, numOfPartitions)

      

0


source to share







All Articles
Loading...
X
Show
Funny
Dev
Pics