Is it a driver or workers that read a text file when using sc.textfile?

I am wondering how sc.textfile is used in Spark. I assume that the driver reads part of the file at a time and distributes the read text to the worker, to process. Or do workers read text directly from a file without driver intervention?

+3


source to share


2 answers


The driver looks at the file's metadata - check if it exists, check which files are in the directory, if it is a directory, and check their sizes. It then sends the jobs to the workers, who actually view the contents of the file. The link is essentially "you are reading this file starting at this offset, for this length."

HDFS splits large files into chunks, and spark (usually / often) splits tasks according to chunks, so the process of going to that offset will be efficient.



Other file systems work in the same way, although not always. Compression can also go wrong with this process if the codec is not split-split.

+6


source


textfile

creates an RDD as specified by ref :

RDD text files can be generated using the SparkContexts TextFile method.

There is also this note:



If you are using the path to the local filesystem, the file must also be accessible from the same path on the worker nodes. Copy the file to all workers or use a network shared file system.

which implies that your assumption that the driver parses the file and then propagates the data to the slaves is incorrect.

0


source







All Articles