Why isn't Spark (on Google Dataproc) using all attachments?
I am running a spark job on a Google DataProc cluster. But it looks like Spark is not using everything vcores
available in the cluster as you can see below
Based on some other questions like this and I set the cluster to use to DominantResourceCalculator
consider both vcpus and memory for resource allocation
gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
But when I post my work with custom spark flags, it looks like YARN doesn't respect those custom options and defaults for using memory as a criterion for computing resources
gcloud dataproc jobs submit pyspark --cluster cluster_name \
--properties spark.sql.broadcastTimeout=900,spark.network.timeout=800\
,yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator\
,spark.dynamicAllocation.enabled=true\
,spark.executor.instances=10\
,spark.executor.cores=14\
,spark.executor.memory=15g\
,spark.driver.memory=50g \
src/my_python_file.py
Can someone figure out what's going on here?
source to share
I was wrong adding the config yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
to YARN
instead capacity-scheduler.xml
(as it should be correct) when creating the cluster
Secondly, I changed yarn:yarn.scheduler.minimum-allocation-vcores
which was originally set to 1
.
I'm not sure if one of these or both of these changes resulted in a solution (I'll update soon). My new cluster creation looks like this:
gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.minimum-allocation-vcores=4--capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
source to share
First, how did you enable dynamic allocation, you have to set properties spark.dynamicAllocation.maxExecutors
and spark.dynamicAllocation.minExecutors
(see https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation )
Second, make sure you have enough baffles in your spark work. Since you are using dynamic allocation, the yarn allocates enough performers to fit the number of tasks (sections). So check SparkUI if your assignments (more specific: stages) have more tasks than the available vCores
source to share