Limiting the number of processes running concurrently from a Python script

Question

Limiting the number of processes running concurrently from a Python script

I am running a backup script that spawns child processes to perform backups using rsync. However, I have no way of limiting the number of rsyncs it runs at the same time.

Here's the code I'm currently working on:

print "active_children: ", multiprocessing.active_children()
print "active_children len: ", len(multiprocessing.active_children())
while len(multiprocessing.active_children()) > 49:
   sleep(2)
p = multiprocessing.Process(target=do_backup, args=(shash["NAME"],ip,shash["buTYPE"], ))
jobs.append(p)
p.start()

This shows a maximum of one child when I have hundreds of rsyncs running. Here is the code that actually starts rsync (from the do_backup function), and command

this is the variable containing the rsync string:

print command
subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
return 1

If I add sleep (x) to the do_backup function, it appears as an active child while sleeping. Also, the process table shows that rsync processes have a PPID of 1. I am assuming that rsync is split off and is no longer a descendant of python, which allows my child process to die, so I can no longer read it. Does anyone know. how to keep the python child alive and be counted until rsync completes?

+3

python multithreading multiprocessing python-multithreading python-multiprocessing

MVanOrder 28 oct. 14 at 13:50

source to share

3 answers

Multiprocessor pool

Have you thought about using multiprocessing.Pool ? This allows you to define a fixed number of workflows that are used to complete the required tasks. The key here is a fixed number, which will give you full control over the number of rsync instances you will run.

After looking at the example given in the related documentation, you first declare processes Pool

from n

and then you can decide whether map()

or apply()

(with their respective _async()

siblings) your work should be pooled.

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes

    pool.apply_async(f, (10,))    # evaluate "f(10)" asynchronously
    ...
    pool.map(f, range(10))

The obvious advantage here is that you will never unexpectedly unlock your machine, as you will only spawn the requested processes n

.

Running rsync

Now your process spawning code will become something like:

from multiprocessing import Pool

def do_backup(arg1, arg2, arg3, ...):
    # Do stuff

if __name__ == '__main__':
    # Start a Pool with 4 processes
    pool = Pool(processes=4)
    jobs = []

    for ... :
        # Run the function
        proc = pool.apply_async(func=do_backup, args=(shash["NAME"],ip,shash["buTYPE"], ))
        jobs.append(proc)

    # Wait for jobs to complete before exiting
    while(not all([p.ready() for p in jobs])):
        time.sleep(5)

    # Safely terminate the pool
    pool.close()
    pool.join()

+6

JoErNanO 28 oct. 14 at 13:53

source to share

It is not multithreading, but multiprocessing. I am assuming you are on a Unix system if you are using it rsync

, although I believe it may work on Windows systems. To control the death of spawned child processes, you need fork

them.

A good question on how to do this in Python is here .

0

Alex W 28 oct. 14 at 13:56

source to share

goncalopp · Accepted Answer · 2014-10-28T14:20:29+0000

Let me first clear up some misconceptions

My guess is that rsync is decaying and is no longer a python child that allows my child to die, so I can't count that anymore.

rsync

performs a "shutdown". On UNIX systems, this is called a fork .

When a process forks, a child process is created - hence rsync

there is a python child. This child performs independently of the parent - and simultaneously ("at the same time").

The process can be run by its own children. There are special syscalls for this , but this is a bit off topic when it comes to python, which has its own high level interfaces

If you check the subprocess.Popen

documentation , you will notice that this is not a function call at all: it is a class. By calling it, you will create an instance of this class - the Popen object . Such objects have several methods. In particular, wait

will allow you to block your parent process (python) before the child process terminates.

With that in mind, take a look at your code and simplify it a bit:

p = multiprocessing.Process(target=do_backup, ...)

This is where you draw and create a child process. This process is another python interpreter (as it is for all processes multiprocessing

) and will execute the function do_backup

.

def do_backup()
    subprocess.Popen("rsync ...", ...)

Here you click again ... You will create another process ( rsync

) and have it run "in the background" because you are not wait

for it.

When this all clears up, I hope you can see the way forward with your existing code. If you want to reduce its complexity, I recommend that you check and adapt JoErNanO's answer, which is reused multiprocessing.Pool

to automate process tracking.

Whichever way you decide to chase, you should avoid using markup with Popen

to create a process rsync

as it creates an unnecessary process. Instead, check os.execv

which replaces the current process with another

Limiting the number of processes running concurrently from a Python script

Multiprocessor pool

Running rsync

More articles: