Reading CSV with pandas in parallel creates huge memory leaks / zombie processes

I am reading +1000 from ~ 200Mb CSV in parallel and saving the modified CSV after using pandas. This creates a lot of zombie processes that accumulate up to + 128GB of RAM, ruining performance.

    csv_data = []
    c = zip(a, b)
    process_pool = Pool(cpu_count())
    for name_and_index in process_pool.starmap(load_and_process_csv, c):
        csv_data.append(name_and_index)
    process_pool.terminate()
    process_pool.close()
    process_pool.join()

      

This is my current solution. This doesn't seem to create a problem until you have processed more than 80 CSVs or so.

PS: Even when the pool is complete, 96GB of RAM is still taken up and you can see python processes taking up RAM but not doing anything or killing. Moreover, I know with certainty that the function that the pool starts is ending.

I hope this is descriptive enough.

+3


source to share


2 answers


The Python module multiprocessing

is process-based. Therefore, it is natural that you have many processes.

Even worse, these processes do not exchange memory, but exchange data via pickling/unpickling

. Therefore, they are very slow if big data needs to be moved between processed data, which is happening here.



In this case, since the processing is tied to I/O

, you may have better performance using multithreading with the module threading

if there I/O

is a bottleneck. Threads share memory, but they also "share" 1 CPU core, so it doesn't guarantee fast startup, you should try.

Update: If multithreading doesn't help, you're left with many options. Because this case is exactly the opposite of a critical weakness in Python's parallel processing architecture. You can try dask (parallel pandas): http://dask.readthedocs.io/en/latest/

+4


source


Question :

process_pool = Pool(48)
for name_and_index in process_pool.starmap(load_and_process_csv, c):

      

I tried your example code, but I cannot run more than one process

. Your code looks extraordinary, there Pool(48)

will be many next to this processes

. To run more than one process

I need to go to

process_pool = Pool(2)
c_list = [(a,b), (a,b)]
csv_data = process_pool.starmap(load_and_process_csv, c_list)

      

Python "3.6.1 Documentation multiprocessing.pool.Pool.starmap
   starmap (func, iterable [, chunksize])
    Like map (), except that the elements of an iterable are expected to be iterable, which are unpacked as arguments. Hence, iterable [(1,2), (3, 4)] results in [func (1,2), func (3,4)].

Since I don't know anything about (a, b)

, double check that the following does not apply to you.



Python "3.6.1 Documentation multiprocessing.html # all-start-methods
    Explicitly transfer resources to child processes On Unix using the fork method, a child process can use a shared resource created in the parent process using a global resource. However, it is better to pass an object as an argument constructor for the child process. In addition to being (potentially) compatible with Windows and other startup methods, this also ensures that while the child process is still alive, the object will not be garbage collected in the parent process. This can be important if some the resource is freed when the object is garbage collected in the parent process.


Question :
I know with certainty that the function performed by the pool itself ends up.

terminate ()

Stops the worker processes immediately without completing outstanding work.  

      

Please explain why you are calling process_pool.terminate()

?

0


source







All Articles