Persist dataframe ignores StorageLevel
I am working with intrinsically safe SQL frames and I have a problem keeping the speed up of subsequent computations. Specifically, when invoked persist(StorageLevel.MEMORY_AND_DISK)
and then checked in the Spark UI Storage tab, I can see the RDDs are cached, but the storage tier always shows Memory Deserialized 1x Replicated
and the Size on Disk column shows 0.0 B for all RDDs.
I have also tried MEMORY_AND_DISK_SER
but get the same results. I'm curious if anyone has seen this or if I am doing something wrong here. Examining the spark docs shows that when invoked cache()
or persist()
on a dataframe, the default is the storage tier MEMORY_AND_DISK
and using a method cacheTable
in the SQLContext it is indicated that it isCaches the specified table in-memory.
For more information, the general skeleton of my program flow is:
// Here computeHeavyMethod is some code that returns a DataFrame
val tableData = computeHeavyMethod().persist(StorageLevel.MEMORY_AND_DISK)
tableData.write.mode(SaveMode.Overwrite).json(outputLocation)
tableData.createOrReplaceTempView(tableName)
spark.sql("Some sql statement that uses the table created above")
source to share
The documentation says:
MEMORY_AND_DISK
Store RDDs as deserialized Java objects in the JVM. If the RDDs don't fit into memory, store partitions that don't fit on disk and read them from there when needed.
Thus, disk storage is used only in case of insufficient memory (memory)
source to share
As of Spark 2.2.0, at least it looks like "disk" only shows up when the RDD is completely spilled to disk:
StorageLevel: StorageLevel(disk, 1 replicas); CachedPartitions: 36;
TotalPartitions: 36; MemorySize: 0.0 B; DiskSize: 3.3 GB
For a partially spilled RDD, the StorageLevel is displayed as "memory":
StorageLevel: StorageLevel(memory, deserialized, 1 replicas);
CachedPartitions: 36; TotalPartitions: 36; MemorySize: 3.4 GB; DiskSize: 1158.0 MB
source to share