SPARK: Pyspark: how to monitor python workflows

Question
How to monitor pyspark python worker processes in terms of CPU and memory usage.

More Details
According to this document , one SPARK worker can contain 1 or more python processes.

Suppose we have allocated 40g of memory for each performer working on a desktop with up to 200g of memory available. Then according to this documented parameter: "spark.python.worker.memory" we can set the amount of available memory for each python process.

Quoted from the description for the spark.python.worker.memory parameter:

The amount of memory for each python worker process during aggregation in the same format as the JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation exceeds this amount, it will spill data across disks.

Let's assume we are setting spark.python.worker.memory to 2g.

The following questions arise for me:

  • How do we know how many pyspark / spark processes each worker / worker has?
  • How can we keep track of how much memory we are consuming per process, and in general, to see how close we are to the 40g performer limit we have set?
  • How can we keep track of how much discs we spill in each process?
  • In more general terms, how can we optimize pyspark apps or apps using spark.python.worker.memory parameter. It's just a matter of trial / error. If so, how can you test / monitor (similar to above).



Why ... well, we are facing some performance issues that are very specific to our application. We are seeing some inconsistent errors that we cannot reproduce. Thus, we have to control / understand the finer details of what happens every time our application runs.

+3


source to share


1 answer


according to this documented parameter: "spark.python.worker.memory" we can set the amount of available memory for each python process.

This is not true. As explained in the documentation you linked, this parameter is used to control aggregation behavior, not Python working memory in general.

This memory account is for the size of local objects or broadcast variables, only temporary structures used for aggregations.

How do we know how many pyspark / spark processes each worker / worker has?

Python workers can be built up to the limit set by the number of available kernels. Since workers can be started or killed during execution, the actual off-peak worker count may be less.



How can we keep track of how much memory we are consuming per process, and in general, to see how close we are to the 40g performer limit we have set?

No specific answer from Spark. You can use common monitoring tools or resource

from the application itself.

How can we keep track of how much discs we spill in each process?

You can use the Spark REST API to get some information, but overall PySpark's metrics are somewhat limited.

+1


source







All Articles