Persist dataframe ignores StorageLevel

I am working with intrinsically safe SQL frames and I have a problem keeping the speed up of subsequent computations. Specifically, when invoked persist(StorageLevel.MEMORY_AND_DISK)

and then checked in the Spark UI Storage tab, I can see the RDDs are cached, but the storage tier always shows Memory Deserialized 1x Replicated

and the Size on Disk column shows 0.0 B for all RDDs.

I have also tried MEMORY_AND_DISK_SER

but get the same results. I'm curious if anyone has seen this or if I am doing something wrong here. Examining the spark docs shows that when invoked cache()

or persist()

on a dataframe, the default is the storage tier MEMORY_AND_DISK

and using a method cacheTable

in the SQLContext it is indicated that it isCaches the specified table in-memory.

For more information, the general skeleton of my program flow is:

// Here computeHeavyMethod is some code that returns a DataFrame
val tableData = computeHeavyMethod().persist(StorageLevel.MEMORY_AND_DISK)
tableData.write.mode(SaveMode.Overwrite).json(outputLocation)
tableData.createOrReplaceTempView(tableName)

spark.sql("Some sql statement that uses the table created above")

      

+3


source to share


2 answers


The documentation says:

MEMORY_AND_DISK

Store RDDs as deserialized Java objects in the JVM. If the RDDs don't fit into memory, store partitions that don't fit on disk and read them from there when needed.



Thus, disk storage is used only in case of insufficient memory (memory)

0


source


As of Spark 2.2.0, at least it looks like "disk" only shows up when the RDD is completely spilled to disk:

StorageLevel: StorageLevel(disk, 1 replicas); CachedPartitions: 36; 
TotalPartitions: 36; MemorySize: 0.0 B; DiskSize: 3.3 GB

      



For a partially spilled RDD, the StorageLevel is displayed as "memory":

StorageLevel: StorageLevel(memory, deserialized, 1 replicas); 
CachedPartitions: 36; TotalPartitions: 36; MemorySize: 3.4 GB; DiskSize: 1158.0 MB

      

0


source







All Articles