Python multiple subprocess with pool / queue recovers result as soon as it finishes and starts next job in queue

I am currently running a subprocess and parsing stdout on the fly, without waiting for it to complete, to parse stdout.

for sample in all_samples:
    my_tool_subprocess = subprocess.Popen('mytool {}'.format(sample),shell=True, stdout=subprocess.PIPE)
    line = True
    while line:
        myline = my_tool_subprocess.stdout.readline()
        #here I parse stdout..

      

In my script, I am performing this action multiple times, really depending on the number of input samples.

The main problem is that each subprocess is a program / tool that uses 1 processor 100% while running. And it takes time .. maybe 20-40 minutes to enter.

What I would like to achieve is to set the pool, queue (I'm not sure what the exact terminology is here), the N max subprocess execution process runs at the same time. So I might be able to maximize performance and not continue consistently.

Thus, the flow of execution, for example a pool of maximum 4 jobs, should be:

  • Startup sub-process 4.
  • When one of the assignments ends, parse and click Next.
  • Do this until all the jobs in the queue have been completed.

If I can achieve this, I really don't know how I can determine which sample sub-process is complete. At this point, I don't need to identify them as each subprocess is executed sequentially and I am parsing stdout as the subprocess is printing stdout.

This is really important as I need to identify the output of each subprocess and assign the corresponding input / sample to it.

+3


source to share


3 answers


ThreadPool

may be appropriate for your problem, you set the number of worker threads and add jobs and threads will run through all the jobs.



from multiprocessing.pool import ThreadPool
import subprocess


def work(sample):
    my_tool_subprocess = subprocess.Popen('mytool {}'.format(sample),shell=True, stdout=subprocess.PIPE)
    line = True
    while line:
        myline = my_tool_subprocess.stdout.readline()
        #here I parse stdout..


num = None  # set to the number of workers you want (it defaults to the cpu count of your machine)
tp = ThreadPool(num)
for sample in all_samples:
    tp.apply_async(work, (sample,))

tp.close()
tp.join()

      

+7


source


since I understood your question, your problem is that the result of the first process after it finishes is presented to the second process, then to the third and so on, for that you should import the threading module and use the Thread class:

proc = threading.Thread(target=func, args=(func arguments) # Thread class
proc.start()                                   # starting the thread
proc.join()                                    # this ensures that the next thread does no 

      



start until the previous one ends.

0


source


Well, if so, you should write the same code above, if proc.join()

in this case the main thread ( main ) will start the other four threads, this is the case where multithreading into a single process (in other words, the advantages of a multi-core processor) for using a multi-core processor you should use the multiprocessing module like this:

proc = multiprocessing.Process(target=func, args=(funarguments))      
proc.start()

      

this way each will be a separate process and the individual processes can run completely independently of each other

0


source







All Articles