Apache Spark: Unmatched RDD after Next Action?

In spark programming, when I call persist / cache () on the RDD, I find that its duration for reuse is not optimal in many cases:

Namely, it always lasts several hours, after which the RDD store is issued from the artist and disk memory. Sometimes this leads to performance / GC problems: sometimes the RDD storage runs out of memory long after the RDD implementation itself on the driver has been garbage collected (up to a few hours later, but for a job that has a cache / checkpoint it is often still inefficient). Sometimes the opposite is true: the RDD store is leaked, even the RDD object still references the jvm driver, and it can be reused later.

I am looking for a way to override it. The "unpersist ()" function is rarely useful: due to lazy execution, it can only be called after the next action, which cannot be defined at the time of its creation. Is there a pattern to denote RDD as "non-clean after next action"? This can save a lot of memory and disk space.

+3


source to share





All Articles