Numpy: translation of elements increases the file size by a lot (factor 8)

Question

Numpy: translation of elements increases the file size by a lot (factor 8)

I have a 3D array that only contains the values 0, 1 and 2 and wants to translate those values to 0.128 and 255 respectively. I looked around and this thread ( Translate every item in numpy array according to key ) looks like a way.

So, I tried to implement it and it worked, the relevant part of the code will be visible below (I read and write data from the h5 and h5 files, but I doubt it is important, I just mention it in case it does)

#fetch dataset from disk
f = h5py.File('input/A.h5','r') #size = 572kB

#read and transform array
array = f['data'].value  #type = numpy.ndarray
my_dict = {1:128, 2:255, 0:0}
array=np.vectorize(my_dict.get)(array)

#write translated dataset to disk
h5 = h5py.File('output/B.h5', driver=None) #final size = 4.5MB
h5.create_dataset('data', data=array)  
h5.close()

The problem is that the input file (A.h5) is 572kB, the output file (B.h5) is 8 times larger (4.5MB).

What's going on here? I have another array with the same dimensions, which has values from 0 to 255 and is also 572kB in size, so large numbers shouldn't matter. My first guess was that maybe python was creating objects instead of ints, I tried casting to int but the size stays the same.

side note: if I transform the data with 3 padding for loops then the size remains 572kB (but the code is much slower)

+3

python-2.7 numpy h5py

Skum Apr 25. '17 at 9:23

source to share

2 answers

While the linked SO-accepted answer uses np.vectorize

, this is not the fastest choice, especially in the case where you are just replacing 3 small numbers, 0,1,2.

The new answer in this SO question provides a simple and fast indexing alternative:

fooobar.com/questions/2406846 / ...

In [508]: x=np.random.randint(0,3,(100,100,100))
In [509]: x.size
Out[509]: 1000000
In [510]: x1=np.vectorize(my_dict.get, otypes=['uint8'])(x)
In [511]: arr=np.array([0,128,255],np.uint8)
In [512]: x2=arr[x]
In [513]: np.allclose(x1,x2)
Out[513]: True

compare their time:

In [514]: timeit x1=np.vectorize(my_dict.get, otypes=['uint8'])(x)
10 loops, best of 3: 161 ms per loop
In [515]: timeit x2=arr[x]
100 loops, best of 3: 3.48 ms per loop

The index approach is much faster.

There are a few things about np.vectorize

that users often miss.

rejection of speed; it does not promise much speed compared to explicit iteration. However, it makes it easier to iterate over multidimensional arrays.
without otypes

, it determines the type of the returned array from the test calculation. This sometimes causes problems by default. The hint here otypes

is just a convenience giving you the correct type right away.

As a curiosity, here's the time to understand the concepts on the list:

In [518]: timeit x3=np.array([my_dict[i] for i in x.ravel()]).reshape(x.shape)
1 loop, best of 3: 556 ms per loop

h5py

allows you to specify dtype

when saving a dataset. Note type

when I store arrays differently.

In [529]: h5.create_dataset('data1',data=x1, dtype=np.uint8)
Out[529]: <HDF5 dataset "data1": shape (100, 100, 100), type "|u1">
In [530]: h5.create_dataset('data2',data=x1, dtype=np.uint16)
Out[530]: <HDF5 dataset "data2": shape (100, 100, 100), type "<u2">
In [531]: h5.create_dataset('data3',data=x1)
Out[531]: <HDF5 dataset "data3": shape (100, 100, 100), type "|u1">
In [532]: x.dtype
Out[532]: dtype('int32')
In [533]: h5.create_dataset('data4',data=x)
Out[533]: <HDF5 dataset "data4": shape (100, 100, 100), type "<i4">
In [534]: h5.create_dataset('data5',data=x, dtype=np.uint8)
Out[534]: <HDF5 dataset "data5": shape (100, 100, 100), type "|u1">

So, even if you didn't specify uint8

in vectorize

, you could still store it with that type.

+1

hpaulj Apr 25. 17 at 16:34

source to share

xnx · Accepted Answer · 2017-04-25T10:44:18+0000

You probably get a factor of 8 by writing your array as int64 where the original array is stored as uint8. You may try:

array=np.vectorize(my_dict.get)(array).astype(np.uint8)

and then store it down to h5 ...

As @Jaime points out, you keep a copy of the array telling you vectorize

what datatype you want directly:

array=np.vectorize(my_dict.get, otypes=[np.uint8])(array)

Numpy: translation of elements increases the file size by a lot (factor 8)

More articles: