Reading files from disk in Python in parallel

I am currently switching from MATLAB to Python, mainly due to the large number of interesting Machine Learning packages available in Python. But one of the problems that has been a source of confusion for me is parallel processing. Specifically, I want to read thousands of text files from disk in a loop for

, and I want to do this in parallel. In MATLAB, using parfor

instead for

will do the trick, but so far I haven't been able to figure out how to do it in python. Here's an example of what I want to do. I want to read N text files, shape them into an N1xN2 array, and store each into an NxN1xN2 numpy array. And this array will be what I return from the function. Assuming the filenames file0001.dat

, file0002.dat

etc., the code I would like to parallelize looks like this:

import numpy as np
N=10000
N1=200
N2=100
result = np.empty([N, N1, N2])
for counter in range(N):
    t_str="%.4d" % counter        
    filename = 'file_'+t_str+'.dat'
    temp_array = np.loadtxt(filename)
    temp_array.shape=[N1,N2]
    result[counter,:,:]=temp_array

      

I am running codes in a cluster so I can use many processors to work. Therefore, any comment on which of the parallelization methods is more suitable for my task (if there are several) is most welcome.

NOTE. I know about this report, but in this message only variables out1

, out2

, out3

about which to worry, and they were clearly used as arguments to a function to be parallelization. But here I have a lot of 2D arrays that should be read from a file and stored in a 3D array. So, the answer to this question is not enough for my case (or this is how I understood it).

+3


source to share


1 answer


You probably still want to use multiprocessing, just structure it a little differently:

from multiprocessing import Pool

import numpy as np

N=10000
N1=200
N2=100
result = np.empty([N, N1, N2])

filenames = ('file_%.4d.dat' % i for i in range(N))
myshaper = lambda fname: np.loadtxt(fname).reshape([N1, nN2])

pool = Pool()    
for i, temparray in enumerate(pool.imap(myshaper, filenames)):
    result[i, :, :] = temp_array
pool.close()
pool.join()

      

What this means, first get a generator for filenames in filenames

. This means the filenames are not stored in memory, but you can still bind them. Then it creates a lambda function (equivalent to anonymous functions in Matlab) that downloads and converts the file (you can also use a regular function). It then applies this function to each filename across multiple processes and puts the result into a shared array. Then it terminates the processes.



This version uses another idiomatic python. However, an approach that's more similar to your original (albeit less idiomatic) might help you understand a little better:

from multiprocessing import Pool

import numpy as np

N=10000
N1=200
N2=100
result = np.empty([N, N1, N2])

def proccounter(counter):
    t_str="%.4d" % counter        
    filename = 'file_'+t_str+'.dat'
    temp_array = np.loadtxt(filename)
    temp_array.shape=[N1,N2]
    return counter, temp_array

pool = Pool()
for counter, temp_array in pool.imap(proccounter, range(N)):
    result[counter,:,:] = temp_array
pool.close()
pool.join()

      

It just breaks most of your loop for

into a function, applies that function to each element of the range using multiple processors, and then puts the result into an array. Basically this is your original function with the loop for

split into two loops for

.

+2


source







All Articles