Writing to cloud storage as a side effect in a cloud data stream

Question

Writing to cloud storage as a side effect in a cloud data stream

I have a cloud data flow app that does a bunch of appengine app processing. At one stage in the pipeline, I am making a group by a specific key, and for every record that matches that key, I would like to write a file to Cloud Storage (using that key as part of the file name).

I don't know in advance how many of these entries will be. So this usage pattern does not follow the standard cloud datastream sink patterns (where a shard of this output stage defines # of the output files, and I have no control over the names of the output files per chunk).

I am considering writing to Cloud Storage directly as a side effect in the ParDo function, but have the following requests:

Is writing cloud storage as a permitted side effect?
If I were writing externally a data flow pipeline, it seems I should be using a Java client for the JSON cloud storage API. But this requires authentication via OAUTH to do any work: and that seems out of place for work already running on GCE machines, as part of a data flow pipeline: will this work?

Any advice is greatly appreciated.

+3

google-cloud-storage google-cloud-dataflow

Alex wilson 07 June 15 at 20:53

source to share

2 answers

Ivan Tarasov · Answer 1 · 2015-06-08T23:29:55+0000

Answering the first part of your question:

While nothing directly stops you from performing side effects (like writing to cloud storage) in our pipeline code, this is usually a very bad idea. You should be aware of the fact that your code is not single-threaded on the same machine. You will need to solve several problems:

Several authors can write at the same time. You need to find a way to avoid conflicts between writers. Since Cloud Storage does not support direct attachment to an object, you can use the Composite objects methods .
Workers can be interrupted, for example. in case of transient failures or infrastructure issues, which means you should be able to handle the interrupt / incomplete write problem.
Workers can be restarted (after they have been interrupted). This will cause the side effect code to run again. So you should be able to handle duplicate entries in your output one way or another.

Jeremy Lewi · Answer 2 · 2015-06-08T01:08:08+0000

Nothing in Dataflow allows you to write to a GCS file in ParDo.

You can use GcpOptions.getCredential () to get the credentials used for authentication. This will use the appropriate mechanism to get the credentials depending on how it works. For example, it will use the service account when the job runs on the Dataflow service.

Writing to cloud storage as a side effect in a cloud data stream

More articles: