Spark Structured Streaming does not restart on Kafka offsets

Question

Spark Structured Streaming does not restart on Kafka offsets

We have a long Spark Structured Streaming request that is being read from Kafka, and we would like this request to take where it left off after restarting. However, we set it startingOffsets

to " earliest

" and what we see after restarting is that the request is read again from the beginning of the Kafka topic.

Our main request looks like this:

  val extract = sparkSession
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "server:port")
    .option("subscribe", "topic")
    .option("startingOffsets", "earliest")
    .load()

  val query: StreamingQuery = extract 
    .writeStream
    .option("checkpointLocation", s"/tmp/checkpoint/kafka/")
    .foreach(writer)
    .start()

We can see that the checkpoint directory is being created correctly and with the offsets that we expect in the offset files.

On restart, we see a message like:

25-07-2017 14:35:32 INFO  ConsumerCoordinator:231 - Setting newly assigned partitions [KafkaTopic-2, KafkaTopic-1, KafkaTopic-0, KafkaTopic-3] for group spark-kafka-source-dedc01fb-c0a7-40ea-8358-a5081b961968--1396947302-driver