Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn

Question

Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn

I am trying to write Google PubSub posts to Google Cloud Storage using Google Cloud Dataflow. I know TextIO / AvroIO doesn't support streaming pipelines. However, I read in [1] that the entry to GCS in the streaming pipeline is from the ParDo/DoFn

author's comment. I built the pipeline by following their article as closely as I could.

I was aiming for this behavior:

Messages written in batches of up to 100 objects to GCS (one per window) along a path that corresponds to the time the message was posted to dataflow-requests/[isodate-time]/[paneIndex]

.

I am getting different results:

Each hourly window has only one window. So I only get one file per watch bucket (this is indeed the path of the object in the GCS). Decreasing MAX_EVENTS_IN_FILE to 10 didn't matter, but only one panel / file.
There is only one message in each GCS object.
The pipeline sometimes throws a CRC error while writing to the GCS.

How do I fix these problems and get the behavior I expect?

Sample log output:

21:30:06.977 writing pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:06.977 writing pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:07.773 sucessfully write pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:07.846 sucessfully write pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:07.847 writing pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0

Here is my code:

package com.example.dataflow;

import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.PubsubIO;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import com.google.cloud.dataflow.sdk.transforms.windowing.*;
import com.google.cloud.dataflow.sdk.values.PCollection;
import com.google.gcloud.storage.BlobId;
import com.google.gcloud.storage.BlobInfo;
import com.google.gcloud.storage.Storage;
import com.google.gcloud.storage.StorageOptions;
import org.joda.time.Duration;
import org.joda.time.format.ISODateTimeFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;

public class PubSubGcsSSCCEPipepline {

    private static final Logger LOG = LoggerFactory.getLogger(PubSubGcsSSCCEPipepline.class);

    public static final String BUCKET_PATH = "dataflow-requests";

    public static final String BUCKET_NAME = "myBucketName";

    public static final Duration ONE_DAY = Duration.standardDays(1);
    public static final Duration ONE_HOUR = Duration.standardHours(1);
    public static final Duration TEN_SECONDS = Duration.standardSeconds(10);

    public static final int MAX_EVENTS_IN_FILE = 100;

    public static final String PUBSUB_SUBSCRIPTION = "projects/myProjectId/subscriptions/requests-dataflow";

    private static class DoGCSWrite extends DoFn<String, Void>
        implements DoFn.RequiresWindowAccess {

        public transient Storage storage;

        { init(); }

        public void init() { storage = StorageOptions.defaultInstance().service(); }

        private void readObject(java.io.ObjectInputStream in)
                throws IOException, ClassNotFoundException {
            init();
        }

        @Override
        public void processElement(ProcessContext c) throws Exception {
            String isoDate = ISODateTimeFormat.dateTime().print(c.window().maxTimestamp());
            String blobName = String.format("%s/%s/%s", BUCKET_PATH, isoDate, c.pane().getIndex());

            BlobId blobId = BlobId.of(BUCKET_NAME, blobName);
            LOG.info("writing pane {} to blob {}", c.pane().getIndex(), blobName);
            storage.create(BlobInfo.builder(blobId).contentType("text/plain").build(), c.element().getBytes());
            LOG.info("sucessfully write pane {} to blob {}", c.pane().getIndex(), blobName);
        }
    }

    public static void main(String[] args) {
        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
        options.as(DataflowPipelineOptions.class).setStreaming(true);
        Pipeline p = Pipeline.create(options);

        PubsubIO.Read.Bound<String> readFromPubsub = PubsubIO.Read.named("ReadFromPubsub")
                .subscription(PUBSUB_SUBSCRIPTION);

        PCollection<String> streamData = p.apply(readFromPubsub);

        PCollection<String> windows = streamData.apply(Window.<String>into(FixedWindows.of(ONE_HOUR))
                .withAllowedLateness(ONE_DAY)
                .triggering(AfterWatermark.pastEndOfWindow()
                        .withEarlyFirings(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE))
                        .withLateFirings(AfterFirst.of(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE),
                                AfterProcessingTime.pastFirstElementInPane()
                                        .plusDelayOf(TEN_SECONDS))))
                .discardingFiredPanes());

        windows.apply(ParDo.of(new DoGCSWrite()));

        p.run();
    }


}

[1] https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/

Thanks to Sam McVeety for the solution. Here is the corrected code for those reading:

package com.example.dataflow;

import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.PubsubIO;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.*;
import com.google.cloud.dataflow.sdk.transforms.windowing.*;
import com.google.cloud.dataflow.sdk.values.KV;
import com.google.cloud.dataflow.sdk.values.PCollection;
import com.google.gcloud.WriteChannel;
import com.google.gcloud.storage.BlobId;
import com.google.gcloud.storage.BlobInfo;
import com.google.gcloud.storage.Storage;
import com.google.gcloud.storage.StorageOptions;
import org.joda.time.Duration;
import org.joda.time.format.ISODateTimeFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Iterator;

public class PubSubGcsSSCCEPipepline {

    private static final Logger LOG = LoggerFactory.getLogger(PubSubGcsSSCCEPipepline.class);

    public static final String BUCKET_PATH = "dataflow-requests";

    public static final String BUCKET_NAME = "myBucketName";

    public static final Duration ONE_DAY = Duration.standardDays(1);
    public static final Duration ONE_HOUR = Duration.standardHours(1);
    public static final Duration TEN_SECONDS = Duration.standardSeconds(10);

    public static final int MAX_EVENTS_IN_FILE = 100;

    public static final String PUBSUB_SUBSCRIPTION = "projects/myProjectId/subscriptions/requests-dataflow";

    private static class DoGCSWrite extends DoFn<Iterable<String>, Void>
        implements DoFn.RequiresWindowAccess {

        public transient Storage storage;

        { init(); }

        public void init() { storage = StorageOptions.defaultInstance().service(); }

        private void readObject(java.io.ObjectInputStream in)
                throws IOException, ClassNotFoundException {
            init();
        }

        @Override
        public void processElement(ProcessContext c) throws Exception {
            String isoDate = ISODateTimeFormat.dateTime().print(c.window().maxTimestamp());
            long paneIndex = c.pane().getIndex();
            String blobName = String.format("%s/%s/%s", BUCKET_PATH, isoDate, paneIndex);

            BlobId blobId = BlobId.of(BUCKET_NAME, blobName);

            LOG.info("writing pane {} to blob {}", paneIndex, blobName);
            WriteChannel writer = storage.writer(BlobInfo.builder(blobId).contentType("text/plain").build());
            LOG.info("blob stream opened for pane {} to blob {} ", paneIndex, blobName);
            int i=0;
            for (Iterator<String> it = c.element().iterator(); it.hasNext();) {
                i++;
                writer.write(ByteBuffer.wrap(it.next().getBytes()));
                LOG.info("wrote {} elements to blob {}", i, blobName);
            }
            writer.close();
            LOG.info("sucessfully write pane {} to blob {}", paneIndex, blobName);
        }
    }

    public static void main(String[] args) {
        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
        options.as(DataflowPipelineOptions.class).setStreaming(true);
        Pipeline p = Pipeline.create(options);

        PubsubIO.Read.Bound<String> readFromPubsub = PubsubIO.Read.named("ReadFromPubsub")
                .subscription(PUBSUB_SUBSCRIPTION);

        PCollection<String> streamData = p.apply(readFromPubsub);
        PCollection<KV<String, String>> keyedStream =
                streamData.apply(WithKeys.of(new SerializableFunction<String, String>() {
                    public String apply(String s) { return "constant"; } }));

        PCollection<KV<String, Iterable<String>>> keyedWindows = keyedStream
                .apply(Window.<KV<String, String>>into(FixedWindows.of(ONE_HOUR))
                        .withAllowedLateness(ONE_DAY)
                        .triggering(AfterWatermark.pastEndOfWindow()
                                .withEarlyFirings(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE))
                                .withLateFirings(AfterFirst.of(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE),
                                        AfterProcessingTime.pastFirstElementInPane()
                                                .plusDelayOf(TEN_SECONDS))))
                        .discardingFiredPanes())
                .apply(GroupByKey.create());


        PCollection<Iterable<String>> windows = keyedWindows
                .apply(Values.<Iterable<String>>create());


        windows.apply(ParDo.of(new DoGCSWrite()));

        p.run();
    }

}

+13

google-cloud-storage google-cloud-dataflow google-cloud-pubsub

Frank wilson 08 Apr 16 at 20:48

source to share

2 answers

Starting with 2.0 We do Beam, TextIO

/ AvroIO

maintain a record of unlimited collections - see. Documentation , in particular, you need to specify withWindowedWrites()

.

+2

jkff 16 jul. 17 at 14:03

source to share

Sam McVeety · Accepted Answer · 2016-04-09T00:00:21+0000

Here's what you need GroupByKey

to know for the panels to be aggregated. The Spotify example refers to this as "Materializing panels is done in a Cumulative Events transformation, which is nothing more than a GroupByKey transformation," but that's a subtle point. You will need to provide a key to do this and in your case a constant value will be displayed.

  PCollection<String> streamData = p.apply(readFromPubsub);
  PCollection<KV<String, String>> keyedStream =
        streamData.apply(WithKeys.of(new SerializableFunction<String, String>() {
           public Integer apply(String s) { return "constant"; } }));

At this point you can apply your windowing function and then final GroupByKey

to get the desired behavior:

  PCollection<String, Iterable<String>> keyedWindows = keyedStream.apply(...)
       .apply(GroupByKey.create());
  PCollection<Iterable<String>> windows = keyedWindows
       .apply(Values.<Iterable<String>>create());

Items in processElement

will now be Iterable<String>

100 or more.

We created https://issues.apache.org/jira/browse/BEAM-184 to make this procedure clearer.

Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn

More articles: