Apache Spark DAGScheduler Missing Parents

When running my iterative program on Apache Spark, I sometimes get the message:

INFO scheduler.DAGScheduler: Missing parents for Stage 4443: List(Stage 4441, Stage 4442)

      

I understand this means it needs to compute the parent RDD

, but I'm not 100% sure. I don't just get one of them, I end up with 100, if not thousands of them at a time - this completely slows down my program, and the other iteration doesn't complete within 10-15 minutes (they usually take 4-10 seconds).

I cache

principal RDD

at each iteration using StorageLevel.MEMORY_AND_DISK_SER

. The next iteration uses this one RDD

. Therefore, the line RDD

becomes very large, so caching is required. However, if I cache (and spill to disk) how can I lose the parent?

+3


source to share


1 answer


I quote Imran Rashid from Cloudera:

It is okay for the stages to be skipped if they are shuffle stages that are read multiple times. For example, here's a small sample program I wrote earlier to demonstrate this: "d3" does not need to be reshuffled as it is read with the same delimiter every time it is read. Thus, skipping steps is good:

val partitioner = new org.apache.spark.HashPartitioner(10)
val d3 = sc.parallelize(1 to 100).map { x => (x % 10) -> x}.partitionBy(partitioner)
(0 until 5).foreach { idx =>
val otherData = sc.parallelize(1 to (idx * 100)).map{ x => (x % 10) -> x}.partitionBy(partitioner)
println(idx + " ---> " + otherData.join(d3).count())
} 

      

If you run this, you look in the UI that you will see that all tasks except the first one have one step that is skipped. You will also see this in the log:



15/06/08 10:52:37 INFO DAGScheduler: Final Stage Parents: List (Stage 12, Stage 13)

15/06/08 10:52:37 INFO DAGScheduler: Missing Parents: List (Step 13)

True, this is not very clear, but it just indicates that DAGScheduler first created stage 12 as a necessary step, and then later changed his mind, realizing that everything that is needed for stage 12 already exists, so there was nothing to do ...

For the email source see the following: http://apache-spark-developers-list.1001551.n3.nabble.com/

+5


source







All Articles