Is structured streaming real-time?
We know that Flink is truly a real-time streaming processing engine that can only process records when they arrive, and we also know that sparking is a micro-batch download streaming processing engine.
However, we also know that the spark has released a structured stream, how about that? Is this really a real streaming processor just like Flink that can deal with recording right away when it arrives instead of micro-batch, or is it still using micro-batch mode?
source to share
Is structured real-time streaming an engine?
TL; DR No. Or yes. Depends on the definition of "real-time stream processing engine".
Prior to 2.3.0-SNAPSHOT (current master ), Structured Streaming uses micropackages and nothing seems to suggest it will be different in future releases.
A deep dive into the built streaming streaming streaming engine
StreamExecution (runtime for streaming request) starts a separate thread of execution that checks for new records.
Once started microBatchThread
(which is a regular Java object java.lang.Thread
), it executes runBatches , which starts the execution of each trigger interval .
As you walk through the code, you can see the internal execution engine for the streaming requests it makes for each trigger.
I understand that nothing has changed in terms of micro-dosage. This was similar to Spark Streaming and is also used in Structured Streaming.
Shameless plugin: you can explore the topic in more detail by reading my gitbook on Structured Streaming , which I am writing for this very purpose, to understand the lowest level details. Comments are welcome.
source to share
In the last Back to the Spark Summit (SF June 2017) they talked about the Continuous Pipeline and the new microbatches-free execution model with breakpoints for latency <1ms (instead of 10-100ms that is possible today), see Slide 35 and Spark-20928 .
But the target version is 2.3.0.
source to share