Np.load () is unusually slow for large files

So, I am uploading some files using np.load () in a size range of 150 to 250 MB. each file contains an array with 5 auxiliary arrays with some data. some files download in less than a second, while others take up to 5 seconds to download, and since I have a lot of such files, they take quite a long time to work with due to the slow download times. But I noticed that if I split the files into 5 smaller files (1 extra array per file) then load times are always easy under the second 5 files.

what could be the reason for this? how can I speed up np.load () without splitting each file into smaller files?

+3


source to share


1 answer


The root of the problem is that there is actually no concept of subarrays in numpy.

Consider the following example:

import numpy as np

a1 = np.ones(2**17)
a2 = np.arange(2**18)
a3 = np.random.randn(2**19)

a = np.array([a1, a2, a3])

print(a.dtype)  # object

      

If you are putting arrays in a numpy array, then you don't know that they are arrays. Instead, it treats them as generic Python objects. This is what the documentationnp.save

says :

allow_pickle : bool, optional

Allow saving arrays of objects using Python pickles. [...] Default: True

So what happens is that the subarrays are being processed by the sorter, which is super inefficient. Obviously this does not happen if you store the arrays separately. They are now effectively stored as numpy arrays. Unfortunately, you cannot just set allow_pickle=False

, because then it won't let you store arrays of objects.



The solution is to use np.savez

multiple arrays for storage. Below is a time comparison with the above arrays:

np.save('test.npy', a)
%timeit np.load('test.npy')  # 10 loops, best of 3: 40.4 ms per loop

np.savez('test2.npz', a1, a2, a3)
%timeit np.load('test2.npz')  # 1000 loops, best of 3: 339 Β΅s per loop

      

You can get arrays with

 x = np.load('test2.npz')
 a1 = x['arr_0']
 a2 = x['arr_1']
 # ...

      

It may be nicer to pass arrays as keyword arguments to savez

, allowing you to give them names:

np.savez('test3.npz', a1=a1, a2=a2, timmy=a3)
x = np.load('test3.npz')
a1 = x['a1']
a2 = x['a2']
a3 = x['timmy']

      

+2


source







All Articles