How does Acyclic Graph, implemented in Hadoop or Spark work?

Question

How does Acyclic Graph, implemented in Hadoop or Spark work?

I keep getting the term DAG in different contexts in the Hadoop ecosystem like

when any action is called on the RDD, Spark creates a DAG and sends it to the DAG scheduler

or

The DAG model is a strict generalization of the MapReduce model

How is this implemented in Hadoop or Spark?

+3

hadoop directed-acyclic-graphs apache-spark hadoop2

NeoWelkin June 22. 17 at 6:01

source to share

2 answers

Jacek Laskowski · Answer 1 · 2017-06-22T06:31:36+0000

The very first DAG that you (as a Spark developer) will "run" is when you apply transformations to your dataset as an RDD.

After creating the RDD (by loading a dataset from external storage or building from a local collection), you start with a single node of the RDD line .

val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[1] at parallelize at <console>:24 []

Immediately after transformation, for example map

, you create another RDD with the original being its parent.

val even = nums.map(_ * 2)
scala> even.toDebugString
res1: String =
(8) MapPartitionsRDD[2] at map at <console>:26 []
 |  ParallelCollectionRDD[1] at parallelize at <console>:24 []

Etc. By transforming an RDD with transform operators, you are plotting a transform graph, which is an RDD line that is simply a directed acyclic RDD dependency graph .

Another DAG you can talk about is when you perform an action on the RDD that will trigger a spark. This work of Spark on RDD will eventually be mapped into multiple stages (through DAGScheduler

) which will again create a stage graph, which is a directed acyclic stage graph .

There are no other DAGs in Spark.

I cannot comment on Hadoop.

dexter · Answer 2 · 2017-06-22T18:50:13+0000

enter image description here

Spark

lines = spark.textfile("hdfs://<file_path>",2)

...

Here rdd lines have 2 sections. The above diagram says A, B, C and D such rdds have 2 partitions each (red boxes). As in the diagram, each rdd is the result of a transformation. Basically the dependencies between rdds are classified into narrow and wide dependencies. Narrow dependencies are formed when each section of the parent rdd is used by only one section of the child rdd, while data shuffling results in wide dependencies.

All narrow dependencies form stage 1 and a broad form of 2 form dependence.

Such stages form a directed acyclic graph

These stages are then passed to the task scheduler.

Hope this helps.

How does Acyclic Graph, implemented in Hadoop or Spark work?

More articles: