Why isn't Spark (on Google Dataproc) using all attachments?

Question

Why isn't Spark (on Google Dataproc) using all attachments?

I am running a spark job on a Google DataProc cluster. But it looks like Spark is not using everything vcores

available in the cluster as you can see below

Based on some other questions like this and I set the cluster to use to DominantResourceCalculator

consider both vcpus and memory for resource allocation

gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

But when I post my work with custom spark flags, it looks like YARN doesn't respect those custom options and defaults for using memory as a criterion for computing resources

gcloud dataproc jobs submit pyspark --cluster cluster_name \
--properties spark.sql.broadcastTimeout=900,spark.network.timeout=800\
,yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator\
,spark.dynamicAllocation.enabled=true\
,spark.executor.instances=10\
,spark.executor.cores=14\
,spark.executor.memory=15g\
,spark.driver.memory=50g \
src/my_python_file.py

Can someone figure out what's going on here?

+3

yarn apache-spark pyspark google-cloud-dataproc

borarak June 13. 17 at 18:48

source to share

2 answers

First, how did you enable dynamic allocation, you have to set properties spark.dynamicAllocation.maxExecutors

and spark.dynamicAllocation.minExecutors

(see https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation )

Second, make sure you have enough baffles in your spark work. Since you are using dynamic allocation, the yarn allocates enough performers to fit the number of tasks (sections). So check SparkUI if your assignments (more specific: stages) have more tasks than the available vCores

0

Raphael Roth June 13. 17 at 19:36

source to share

borarak · Accepted Answer · 2017-06-14T14:18:34+0000

I was wrong adding the config yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

to YARN

instead capacity-scheduler.xml

(as it should be correct) when creating the cluster

Secondly, I changed yarn:yarn.scheduler.minimum-allocation-vcores

which was originally set to 1

.

I'm not sure if one of these or both of these changes resulted in a solution (I'll update soon). My new cluster creation looks like this:

gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.minimum-allocation-vcores=4--capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

Why isn't Spark (on Google Dataproc) using all attachments?

More articles: