The pipeline brings in 8 billion lines from GCS and makes GroupByKey to prevent merging, the group step is very slow

Question

The pipeline brings in 8 billion lines from GCS and makes GroupByKey to prevent merging, the group step is very slow

I read 8 billion lines from GCS, do the processing on each line and then output. My processing step may take a little time and prevent the worker's lease from expiring and getting below the error; I am doing GroupByKey

at 8 billion and group by id is preventing merging .

The task was attempted 4 times without success. Every time the employee ended up losing contact with the service. Work item was attempted:

The problem is that the GroupByKey step always completes for 8 billion rows even on 1000 high memory nodes.

I have considered a possible reason for the slow processing; large size of each value generated per key GroupByKey. I don't think this is possible, because out of 8 billion inputs, one input ID cannot be more than 30 times in this set. So the HotKeys problem is missing here, something else is happening.

Any ideas on how to optimize this are appreciated. Thank.

+3

bigdata google-cloud-dataflow apache-beam etl

PUG June 24. 17 at 3:34 am

source to share

1 answer

PUG · Accepted Answer · 2017-07-02T02:18:42+0000

I managed to solve this problem. My part had several incorrect assumptions about data flow times . I looked at my pipeline and step with the highest wall time; which was a few days later, I thought it was a bottleneck. But in an Apache beam, a step is usually coupled with steps downstream in a pipeline and will only work as fast as a step down a pipeline route. Thus, the wall time, which is significant, is not sufficient to conclude that this step is a bottleneck in the pipeline. The real solution to the problem outlined above came from this thread... I have reduced the number of nodes that my pipeline runs on. And changed the node type from high-mem-2 to high-mem-4. I want there to be an easy way to get memory usage metrics for a data flow pipeline. I had to ssh into virtual machines and do JMAP.

The pipeline brings in 8 billion lines from GCS and makes GroupByKey to prevent merging, the group step is very slow

More articles: