TextIO.Write - add or replace output files (Google Cloud Dataflow)

I can't find any documentation on it, so I'm wondering what is the behavior if the output files already exist (in the gs: // bucket)?

Thanks, G

+3


source to share


1 answer


The files will be overwritten. There are several reasons for this:

  • The use case for a report (calculating a summary of the input data and putting the results in the GCS) seems to be much more common than the use case where you randomly create data and put more of it on the GCS with each pipeline execution.
  • It's good if the pipeline restart is idempotent (-ish?). For example. if you find a bug in your pipeline, you can simply fix it and rerun it, and get the correct results rewritten. The pipeline that is added to the files will be very difficult to work with in this matter.
  • You do not need to specify the number of output shards for TextIO.Write; it may differ slightly between different executions, even for exactly the same pipeline and the same input. The semantics of adding in this case will be very confusing.
  • The addition, as far as I know, cannot be efficiently implemented using any filesystem I am aware of while maintaining guarantees of atomicity and fault tolerance (for example, you manufacture all or none of the output, even in the face of re-executing packages due to crashes ).


This behavior will be documented in the next version of the SDK that appears on github.

+6


source







All Articles