Spark stream + kafka bandwidth

In my sparking I read from the kafka topic. There are 10 sections in this section, so I created 10 sinks with one stream per sink. With this configuration, I can observe strange behavior of the receivers. The median rates for these consumers are:

Receiver-0 node-1 10K
Receiver-1 node-2 2.5K
Receiver-2 node-3 2.5K
Receiver-3 node-4 2.5K
Receiver-4 node-5 2.5K
Receiver-5 node-1 10K
Receiver-6 node-2 2.6K
Receiver-7 node-3 2.5K
Receiver-8 node-4 2.5K
Receiver-9 node-5 2.5K

      

Task 1: node -1 receives as many messages as the other 4 together.

Issue 2: the application does not reach the batch performance limit (30 second batches are computed for an average of 17 seconds). I would like it to consume enough messages to make it at least 25 seconds of computation time.

Where should I look for the bottleneck?

To be clear, there are more messages to use.

@Edit: I only had two sections lag, so the first issue is resolved. However, reading 10k msgs per second is not very much.

+3


source to share


1 answer


Use Sparks built in backpressure (since Spark 1.5, which wasn't available at the time of your question): https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-streaming-backpressure. adoc

Just install



spark.streaming.backpressure.enabled=true
spark.streaming.kafka.maxRatePerPartition=X (really high in your case)

      

To find the bottleneck, you have to use WebUI Sparkstreaming and watch the process DAG most of the time ...

+1


source







All Articles