Spark spark.shuffle.memoryFraction has no effect
I am testing Spark on Amazon EMR using Python and the basic wordcount example that comes with Spark.
After running the application, I realized that in Stage 0 reduceByKey (add) roughly 2.5 GB is shuffled into memory, and 4 GB is spilled onto disk. Since in the wordcount example I am not caching or storing any data, so I thought I could improve the performance of this application by providing more random memory. So, in spark-defaults.conf, I added the following:
spark.storage.memoryFraction 0.2
spark.shuffle.memoryFraction 0.6
However, I still get the same performance, with the same amount of random data spilled on disk and memory. I have confirmed that Spark is reading these configurations using Spark UI / Environment and I can see my changes. Moreover, I tried to set spark.shuffle.spill
to false and I got the view I am looking for and all the data in random order was only passed into memory.
So, what am I wrong here, and why is the extra random share of memory not being used?
My environment:
Amazon EMR with Spark 1.3.1 works using the -x argument
1 Master node: m3.xlarge
3 Master nodes: m3.xlarge
Application: wordcount.py
Input: 10.gz files 90MB each (~ 350MB unarchived) stored in S3
Send command:
/home/hadoop/spark/bin/spark-submit --deploy-mode client /mnt/wordcount.py s3n://<input location>
spark-defaults.conf:
spark.eventLog.enabled false
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO
spark.master yarn
spark.executor.instances 3
spark.executor.cores 4
spark.executor.memory 9404M
spark.default.parallelism 12
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///spark-logs/
spark.storage.memoryFraction 0.2
spark.shuffle.memoryFraction 0.6
source to share
No one has answered this question yet
Check out similar questions: