Spark multiple sessions versus shared global session

Question

What is the motivation for creating multiple Spark apps / sessions instead of sharing a global session?

Description

You have a Spark Standalone Cluster Manager.

Cluster:

  • 5 cars
  • 2 cores (performers) each = 10 performers in total
  • RAM 16 GB each machine

Job:

  • The Dump database requires all (10) executors, but only 1GB of RAM for each executor.
  • To process the dump results, it requires 5 executors with 8-16 GB of RAM.
  • Fast data retrieval task, 5 performers with 1 GB of RAM.
  • etc.

What's the best practice solution? Why should I ever prioritize 1st solution over 2nd or 2nd through 1st when the cluster resource stays the same?

Solutions:

  • Run jobs 1, 2 and 3 from different Spark applications (JVM).
  • Use one global Spark application / session that stores all cluster resources (10 executors, each 8GB RAM). Create a fair scheduler pool for 1st, 2nd and 3rd jobs.
  • Use some hacks like this to run jobs with different configurations from the same JVM. But I'm afraid not very stable (officially supported by the Spark team if you want).
  • [Spark Job Server] [5, but as far as I understand this is the implementation of the first solution

Update

It looks like the second option (global session with all resources + thread pool) is not possible due to the fact that you can only configure the number of cores in the pool.xml ( minShare

) file , but cannot store memory per executor.

+3


source to share





All Articles