What is the meaning of "Stages" in Spark UI for streaming scripts
I am working on Spark Streaming and am trying to monitor and improve performance for streaming applications. But I am confusing the following questions.
- What is the value for each stage on the Spark Portal for the "Spark Streaming" application.
- Not all Transforms map to tasks. And how to customize the "Transformation" for the associated tasks.
Stream code snapshot :
val transformed = input.flatMap(i => processInput(i))
val aggregated = transformed.reduceByKeyAndWindow(reduce(_, _), Seconds(aggregateWindowSizeInSeconds), Seconds(slidingIntervalInSeconds))
val finalized = aggregated.mapValues(finalize(_))
finalized
(Only Flatmap pages happened on the portal.)
Spark Streaming Portal
Thank,
Tao
+3
source to share
1 answer
Spark takes individual commands from your source and then optimizes them into a plan for tasks that will run on the cluster. One example of such an optimization is map-fusion : two map calls appear, and one map task is issued. Stage is a higher level boundary between task groups, defined so that you need to shuffle to cross that boundary.
So:
- each of the operators your RDD calls lead to actions and transformations.
- This leads to DAG operators.
- The DAG is compiled into stages.
- Each stage is performed as a series of tasks.
+1
source to share