No accelerated reading of files with gevent

I need to load ~ 100k files with vectors and aggregate the content in a numpy array. This process takes ~ 3 minutes, so I want to speed it up. I tried to use gevent to speed it up, but I couldn't get it to speed up.

I read that one should use asynchronous calls to speed up I / O calls, not multiprocessing. Further I read that gevent is the recommended library. I wrote an example for loading images where I saw a significant improvement in speed. Here is a simplified version of my code

def chunks(l, n):
    """ Yield successive n-sized chunks from l.
    """
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

file_paths = # list of filenames
numpy_array = numpy.ones([len(file_paths), file_size])
pool = gevent.pool.Pool(poolsize)
for i, list_file_path_tuples in enumerate(chunks(file_paths, CHUNK_SIZE)):
    gevent_results = pool.map(numpy.load, list_file_path_tuples)
    pool.join()
    for i_chunk, result in enumerate(gevent_results):
        index = i * CHUNK_SIZE + i_chunk
        data = result['arr_0']
        numpy_array[index] = data

      

Using chunks is necessary because otherwise I would have all the vectors twice in memory.

Is there a problem in my code or am I using the wrong approach?

+3


source to share


1 answer


Have you profiled your code and know where the hotspot is? If it's not a computation, it's probably just an IO drive. I doubt you are getting the performance gain from tricks in I / O logic. In the end, this will be access to the serial disk, which may be the limit. If you have a RAID system, it makes sense to have multiple threads reading from disk, but you can do this with standard python threads. Try building 1 to multiple and measure along the way to find the sweet spot.



The reason you saw an improvement with simultaneous image uploads with gevent is because IO network throughput can be greatly improved by using multiple connections. A single network connection can hardly saturate a network's bandwidth unless the remote server is directly connected to the network device. Whereas one I / O operation on a single disk can easily saturate the disk bandwidth.

+4


source







All Articles