How can I improve the performance of TextIO or AvroIO when reading a very large number of files?

TextIO.read()

and AvroIO.read()

(as well as some other Beam IO features) by default do not work very well in current Apache Beam players when reading a file that expands into a very large number of files - for example 1M files.

How can I efficiently read such a large number of files?

+3


source to share


1 answer


When you know in advance that a file that is read with TextIO

or AvroIO

will expand into a large number of files, you can use the newly added feature .withHintMatchesManyFiles()

that is currently implemented in TextIO

and AvroIO

.

For example:

PCollection<String> lines = p.apply(TextIO.read()
    .from("gs://some-bucket/many/files/*")
    .withHintMatchesManyFiles());

      



Using this hint results in conversions being performed optimized for reading a large number of files: the number of files that can be read in this case is practically unlimited, and the pipeline will most likely run faster, cheaper, and more reliably than without this hint.

However, it may work worse than without a prompt if the file-folder actually corresponds to only a small number of files (for example, several tens or several hundred files).

Under the hood, this hint causes transformations to be performed accordingly TextIO.readAll()

or AvroIO.readAll()

, which are more flexible and scalable versions read()

that allow you to read PCollection<String>

file folders (where each String

is a file), with the same caveat: if the total number of files corresponding to folder files is small, they may perform worse than simple read()

with the file pass specified during pipeline construction.

0


source







All Articles