If one section is lost, we can use lineage to restore it. Will the underlying RDD be loaded?

I read the document "Resilient Distributed Datasets (Absolute Abstraction with Error for In-Memory Cluster Computing)". The author said that if one section is lost, we can use it to restore it. However, the original RDD did not exist in memory. Will the underlying RDD be loaded again to recover the lost RDD partition?


source to share

1 answer

Yes, as you said, if the RDD that was used to create the partition is no longer in memory, it needs to be loaded from disk again and recalculated. If the original RDD that was used to create your current partition also does not exist (neither in memory nor on disk), then Spark will have to go one step back and recalculate the previous RDD. In the worst case, Spark will have to completely revert to the original data.

If you have long chaining chains like the ones described above as a worst case scenario that could mean long periods of recalculation, then when you should consider using checkpointing , which stores intermediate results in a reliable store (like HDFS), that would get in the way Spark revert entirely to the original data source and use control data instead.

@Comment: I'm having trouble finding any official reference material, but from what I remember from my codebase, Spark only recreates some of the lost data.



All Articles