Spark stream + kafka bandwidth
In my sparking I read from the kafka topic. There are 10 sections in this section, so I created 10 sinks with one stream per sink. With this configuration, I can observe strange behavior of the receivers. The median rates for these consumers are:
Receiver-0 node-1 10K
Receiver-1 node-2 2.5K
Receiver-2 node-3 2.5K
Receiver-3 node-4 2.5K
Receiver-4 node-5 2.5K
Receiver-5 node-1 10K
Receiver-6 node-2 2.6K
Receiver-7 node-3 2.5K
Receiver-8 node-4 2.5K
Receiver-9 node-5 2.5K
Task 1: node -1 receives as many messages as the other 4 together.
Issue 2: the application does not reach the batch performance limit (30 second batches are computed for an average of 17 seconds). I would like it to consume enough messages to make it at least 25 seconds of computation time.
Where should I look for the bottleneck?
To be clear, there are more messages to use.
@Edit: I only had two sections lag, so the first issue is resolved. However, reading 10k msgs per second is not very much.
source to share
Use Sparks built in backpressure (since Spark 1.5, which wasn't available at the time of your question): https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-streaming-backpressure. adoc
Just install
spark.streaming.backpressure.enabled=true
spark.streaming.kafka.maxRatePerPartition=X (really high in your case)
To find the bottleneck, you have to use WebUI Sparkstreaming and watch the process DAG most of the time ...
source to share