"take" action immediately after caching RDD only causes 2% caching

I have an RDD that is formed by reading a local text file approximately 117MB in size.

scala> rdd
res87: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:24

      

I am caching RDD: -

scala> rdd.persist()
res84: rdd.type = MapPartitionsRDD[3] at textFile at <console>:24

      

After that I call the action take (1) "on the RDD to force evalulation. Once that is done, I check the Spark UI Storage page. It shows that the cache is 2%, the size in memory is 6.5 MB. Then I call the " count " action on the RDD. After that, when I check the Spark UI Storage page, I suddenly see that these numbers have now changed. The fraction cache is 82% and the in-memory size is 258.2 MB. Does this mean that even after caching the RDD, Spark actually caches what is actually needed for the subsequent action (since take (1) only reads the top one element)? And when the second action " count", he needed to touch all the elements, so he finished caching the rest as well? I didn't come across any documented behavior like this, is this a bug?

+3


source to share


2 answers


Based on the source code, you are correct. The RDD reference is stored only in the persist RDD HashMap and is registered with a special cleaner when the persist () function is called. Thus, caching is done when the data is actually read. Moreover, it can be biased (for example, when there is not enough memory and no active data reference exists).



+1


source


Spark only materializes rdds on demand, i.e. in response to the action mentioned in the previous answer. Most actions require you to read all sections of rdd, such as count()

, but other actions do not require materializing all sections and for performance reasons they do not. take(x)

and first()

, in essence take(1)

, are examples of such actions. Imagine a case where you have an rdd with millions of records and many partitions, and you only need to examine a few records through take(x)

. Materializing the entire rdd would be wasteful. Instead, Spark materializes a single section and analyzes the number of elements it contains. Based on that count, it materializes more partitions to suit needs take(x)

(here I am simplifying the logictake(x)

).



In your case take(1)

, one section is required, so only 1 section is materialized and cached. Then, when you do count()

, all partitions should be materialized and cached to the level of the available memory.

+1


source







All Articles