How do I limit the number of concurrent map tasks for each performer?

Question

How do I limit the number of concurrent map tasks for each performer?

The map operation in my Spark APP takes RDD[A]

as input and maps each item in RDD[A]

with a custom mapping function func(x:A):B

to another object of type B. Since it func()

requires a significant amount of memory when calculating each input x

, I want to limit the number of concurrent map tasks for each performer like this so that the total amount of memory required for all tasks of the same executor does not exceed the amount of physical memory available on node.

I have checked the available spark configurations but not sure which one to use. Is it used coalesce(numPartitions)

to set the number of partitions RDD[A]

to accomplish this?

+3

mapreduce apache-spark

PC Yin 02 jan. '15 at 6:21

source to share

1 answer

maasg · Answer 1 · 2015-01-02T14:01:09+0000

The number of concurrent tasks for each executor is related to the number of cores available, not the number of tasks, so changing the parallelism level with coalesce

or repartition

will not help limit the used memory for each task, only the amount of data for each section that the task has to handle (*).

As far as I know, there is no way to limit the memory used by one task as it shares the resources of the working JVM and therefore shares the memory with other tasks of the same executor.

Assuming a fair share per task, the guidelines for the amount of available memory for each task (kernel) would be:

spark.executor.memory * spark.storage.memoryFraction / #cores-per-executor

Probably a way to force fewer tasks for each performer, and therefore increase the amount of memory per task, would be to assign more cores to each task using spark.task.cpus

(default = 1)

(*) Given that the problem here is at the level of each x

RDD item , the only possible parameter that could affect memory usage is setting the parallelism level to less than the number of processors a single executor, but this will lead to serious underutilization of cluster resources, since all workers but one of them will be inactive.

How do I limit the number of concurrent map tasks for each performer?

More articles: