Where is the cached RDD stored (i.e. in a distributed fashion or on a single node)?
cache
is a lazy operator (this is more of a hint than a transformation or action in the sense of RDD).
Quoting the cache skag :
cache (): RDD.this.type Store this RDD with standard storage level (MEMORY_ONLY).
When a Spark application is submitted for execution (using spark-submit
), it asks for Spark executors. Quoting from the official Components documentation :
Once connected, Spark acquires executors on the cluster nodes, which are processes that run computations and store data for your application.
Each performer launches its own BlockManager (using the BlockManagerMaster located in the driver). Quoting Mastering the Apache Spark gitbook:
BlockManager is a key store for blocks of data (just blocks) in Spark. The BlockManager acts like a local cache that runs on every "node" in the Spark application, that is, the driver and executors.
Once the action is complete, it starts loading the dataset from external data sources, and the data begins flowing through the Spark distributed computing pipeline.
This is when data is cached in every BlockManager that participates in the computation. Their number is equal to the number of RDD partitions that have been cached and can be checked using the so-called RDD lineage :
RDD Lineage (aka RDD Statement Plot or RDD Dependency Plot ) is a plot of all parent RDD RDDs. It is built by applying transformations to the RDD and generating a logical execution plan.
You can see the RDD line of Spark computation using RDD.toDebugString :
toDebugString: String A description of this RDD and its recursive dependencies for debugging.
The following could be done:
val rdd = sc.parallelize(0 to 9).groupBy(_ % 3).flatMap { case (_, ns) => ns }
scala> rdd.toDebugString
res4: String =
(8) MapPartitionsRDD[7] at flatMap at <console>:24 []
| ShuffledRDD[6] at groupBy at <console>:24 []
+-(8) MapPartitionsRDD[5] at groupBy at <console>:24 []
| ParallelCollectionRDD[4] at parallelize at <console>:24 []
Sections ( 8
linked to the local[*]
main url I used to start spark-shell
from *
, mapped to the number of CPU cores on my machine)
In fact, the number of BlockManagers used is the number of tasks in each stage (in a Spark job) and can be as much as the number of Spark executors. It depends on the stage.
Completion ...
When we cache an RDD in Spark, is it then stored distributed or on a single node?
Distributed path, but it may turn out that it is on one node if the number of partitions in the stage is 1.
What system memory is it stored in?
In the "system memory" of the Spark executor, which hosts the BlockManager, which is responsible for RDD blocks.
RDD is a description of distributed computing and "goes away" when the DAGScheduler (which runs on a driver) maps it to TaskSets at each stage with as many tasks as partitions.
The RDD and sections "disappear" after the action is completed and are converted into stages and tasks.
source to share