How does Acyclic Graph, implemented in Hadoop or Spark work?

I keep getting the term DAG in different contexts in the Hadoop ecosystem like

when any action is called on the RDD, Spark creates a DAG and sends it to the DAG scheduler

or

The DAG model is a strict generalization of the MapReduce model

How is this implemented in Hadoop or Spark?

+3


source to share


2 answers


The very first DAG that you (as a Spark developer) will "run" is when you apply transformations to your dataset as an RDD.

After creating the RDD (by loading a dataset from external storage or building from a local collection), you start with a single node of the RDD line .

val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[1] at parallelize at <console>:24 []

      

Immediately after transformation, for example map

, you create another RDD with the original being its parent.

val even = nums.map(_ * 2)
scala> even.toDebugString
res1: String =
(8) MapPartitionsRDD[2] at map at <console>:26 []
 |  ParallelCollectionRDD[1] at parallelize at <console>:24 []

      



Etc. By transforming an RDD with transform operators, you are plotting a transform graph, which is an RDD line that is simply a directed acyclic RDD dependency graph .

Another DAG you can talk about is when you perform an action on the RDD that will trigger a spark. This work of Spark on RDD will eventually be mapped into multiple stages (through DAGScheduler

) which will again create a stage graph, which is a directed acyclic stage graph .

There are no other DAGs in Spark.

I cannot comment on Hadoop.

+1


source


enter image description here

Spark

lines = spark.textfile("hdfs://<file_path>",2)

...

Here rdd lines have 2 sections. The above diagram says A, B, C and D such rdds have 2 partitions each (red boxes). As in the diagram, each rdd is the result of a transformation. Basically the dependencies between rdds are classified into narrow and wide dependencies. Narrow dependencies are formed when each section of the parent rdd is used by only one section of the child rdd, while data shuffling results in wide dependencies.



All narrow dependencies form stage 1 and a broad form of 2 form dependence.

Such stages form a directed acyclic graph

These stages are then passed to the task scheduler.

Hope this helps.

0


source







All Articles