Spark multiple sessions versus shared global session
Question
What is the motivation for creating multiple Spark apps / sessions instead of sharing a global session?
Description
You have a Spark Standalone Cluster Manager.
Cluster:
- 5 cars
- 2 cores (performers) each = 10 performers in total
- RAM 16 GB each machine
Job:
- The Dump database requires all (10) executors, but only 1GB of RAM for each executor.
- To process the dump results, it requires 5 executors with 8-16 GB of RAM.
- Fast data retrieval task, 5 performers with 1 GB of RAM.
- etc.
What's the best practice solution? Why should I ever prioritize 1st solution over 2nd or 2nd through 1st when the cluster resource stays the same?
Solutions:
- Run jobs 1, 2 and 3 from different Spark applications (JVM).
- Use one global Spark application / session that stores all cluster resources (10 executors, each 8GB RAM). Create a fair scheduler pool for 1st, 2nd and 3rd jobs.
- Use some hacks like this to run jobs with different configurations from the same JVM. But I'm afraid not very stable (officially supported by the Spark team if you want).
- [Spark Job Server] [5, but as far as I understand this is the implementation of the first solution
Update
It looks like the second option (global session with all resources + thread pool) is not possible due to the fact that you can only configure the number of cores in the pool.xml ( minShare
) file , but cannot store memory per executor.
+3
source to share
No one has answered this question yet
See similar questions:
or similar: