Multiprocessing pool with for loop

I have a list of files that I pass into a for loop and execute a whole bunch of functions. What's the easiest way to parallelize this? Not sure if I can find this thing anywhere and I think my current implementation is wrong because I only saw one file that gets started. From some reading I've done, I think this should be a perfectly parallel case.

The old code looks something like this:

import pandas as pd
filenames = ['file1.csv', 'file2.csv', 'file3.csv', 'file4.csv']
for file in filenames:
    file1 = pd.read_csv(file)
    print('running ' + str(file))
    a = function1(file1)
    b = function2(a)
    c = function3(b)
    for d in range(1,6):
            e = function4(c, d)
    c.to_csv('output.csv')

      

(wrong) Parallel code

import pandas as pd
from multiprocessing import Pool
filenames = ['file1.csv', 'file2.csv', 'file3.csv', 'file4.csv']
def multip(filenames):
    file1 = pd.read_csv(file)
    print('running ' + str(file))
    a = function1(file1)
    b = function2(a)
    c = function3(b)
    for d in range(1,6):
            e = function4(c, d)
    c.to_csv('output.csv')

if __name__ == '__main__'
    pool = Pool(processes=4)
    runstuff = pool.map(multip(filenames))

      

That I (think) I want to make to a single file has been calculated on the kernel (perhaps in the process?). I also did

multiprocessing.cpu_count()

      

and got 8 (I have a quad so it probably counts threads). Since I only have 10 files, if I can put one file per process to speed things up, that would be great! I hope the remaining 2 files find the process after the processes from the first round have finished.

Edit: For greater clarity, functions (i.e. function1, function2, etc.) are also passed to other functions (e.g. function1a, function1b) inside their respective files. I am calling function 1 using an import statement.

I am getting the following error:

OSError: Expected file path name or file-like object, got <class 'list'> type

      

Apparently dislikes when the list is passed, but I don't want to do filenames [0] in the if statement because it only runs one file

+3


source to share


1 answer


import multiprocessing
names = ['file1.csv', 'file2.csv']
def multip(name):
     [do stuff here]

if __name__ == '__main__':
    #use one less process to be a little more stable
    p = multiprocessing.Pool(processes = multiprocessing.cpu_count()-1)
    #timing it...
    start = time.time()
    for file in names:
    p.apply_async(multip, [file])

    p.close()
    p.join()
    print("Complete")
    end = time.time()
    print('total time (s)= ' + str(end-start))

      

EDIT: Replace if__name __ == '____main___' for this. This will run all files:



if __name__ == '__main__':

    p = Pool(processes = len(names))
    start = time.time()
    async_result = p.map_async(multip, names)
    p.close()
    p.join()
    print("Complete")
    end = time.time()
    print('total time (s)= ' + str(end-start))

      

+1


source







All Articles