Apache Spark performance tuning
I am working on a project in which I have to tune spark characteristics. I have found four of the most important parameters that will help in tuning spark characteristics. They look like this:
- spark.memory.fraction
- spark.memory.offHeap.size
- spark.storage.memoryFraction
- spark.shuffle.memoryFraction
I wanted to know if I was going in the right direction or not? Please let me know if I missed other parameters as well.
Thanks in advance.
source to share
That's pretty broad to answer honestly. The correct way to optimize performance is mainly described in the official documentation in the Tuning Spark section .
Generally speaking, there are many factors to optimize spark performance:
- Serializing data
- Memory setting
- Parallelism level
- Using memory to reduce tasks
- Passing large variables
- Location of data
It is mainly centralized around data serialization, memory tuning, and tradeoffs between precision / approximation methods to get the job done quickly.
EDIT:
Courtesy of @ zero323:
I would point out that all but the one mentioned in the question are deprecated and only used in legacy mode.
source to share
We can split the problem into two parts.
- Run it
- Optimizing cost and time
If shared, depending on whether the memory in question is intrinsically safe memory or user memory, the spark will spill or OOM. I think the memory tweak part will also include the total artist memory size.
In the second question: how to optimize costs, time, compute, etc. Try Sparklens https://github.com/qubole/sparklens Shameless plugin (author). In most cases, the real question is not that the application is slow, but at scale or even when using these resources. And for most applications, the answer is overwhelming.
The structure of a spark application places important restrictions on its scalability. The number of tasks in a stage, dependencies between stages, skew and the amount of work done by the driver are the main constraints.
One of the best things about Sparklens is that it simulates and tells you how your spark application will perform with different performer counts and what is the expected level of cluster usage for each performer count. Helps you make the right trade-off between time and efficiency.
source to share