How to swap memory from HDF5 dataset using NumPy ndarray

Question

How to swap memory from HDF5 dataset using NumPy ndarray

I am writing an application to stream data from a sensor and then process the data in various ways. These processing components include data visualization, some crunching (linear algebra), and HDF5 data writing to disk. Ideally, each of these components would be their own module, they all run in the same Python process, so IPC is not a problem. This brings me to the question of how to store streaming data efficiently.

The datasets are quite large (~ 5 GB), so I would like to minimize the number of copies of the data in memory by splitting them between the components that need to be accessed. If all components use a direct ndarray

s, then this should be simple: give one of the processes data, then give everyone else a copy with ndarray.view()

.

However, the component that writes the data to the disc saves the data in HDF5 format Dataset

. They interact in a ndarray

variety of ways, but creation view()

does n't seem to work like ndarrary

s.

Observe with ndarray

s:

>>> source = np.zeros((10,))
>>> view = source.view()
>>> source[0] = 1
>>> view[0] == 1
True
>>> view.base is source
True

However, this doesn't work with HDF5 Dataset

s:

>>> import h5py
>>> file = h5py.File('source.h5', 'a')
>>> source_dset = file.create_dataset('source', (10,), dtype=np.float64)
>>> view_dset = source_dset.value.view()
>>> source_dset[0] = 1
>>> view_dset[0] == 1
False
>>> view_dset.base is source_dset.value
False

Also fails to just assign myself Dataset.value

instead view

.

>>> view_dset = source_dset.value
>>> source_dset[0] = 2
>>> view_dset[0] == 2
False
>>> view_dset.base is source_dset.value
False

So my question is this: Is there a way to have shared memory ndarray

with HDF5 Dataset

in the same way that two ndarray

can share memory?

I guess this is unlikely to work, possibly due to some subtlety in how HDF5 stores arrays in memory. But it confuses me a little, especially that the type(source_dset.value) == numpy.ndarray

flag OWNDATA

Dataset.value.view()

really is False

. Who owns the memory it interprets view

?

Version details: Python 3, NumPy version 1.9.1, h5py version 2.3.1, HDF5 version 1.8.13, Linux.

Other information: The HDF5 file is fragmented.

EDIT

After playing with this a little more, it seems like one possible solution is to provide other components with a link to HDF5 itself Dataset

. It doesn't seem to copy any memory (at least not according to top

), and changes to the source Dataset

are reflected in the view.

>>> import h5py
>>> file = h5py.File('source.h5', 'a')
>>> source = file.create_dataset('source', (10,), dtype=np.float64)
>>> class Container():
    ...    def __init__(self, source_dset):
    ...        self.dset = source_dset
    ...
>>> container = Containter(source)
>>> source[0] = 1
>>> container.dset[0] == 1
True

I'm fairly happy with this solution (as long as it saves memory), but I'm still wondering why the above approach view

doesn't work.

+3

python numpy hdf5 h5py

bnaecker Dec 10. 14 at 19:42

source to share

1 answer

hpaulj · Accepted Answer · 2014-12-13T06:34:37+0000

The short answer is that you cannot swap memory between an array numpy

and a dataset h5py

. While they have a similar API (at least when it comes to indexing), they don't have a compatible memory layout. In fact, apart from the cache, the dataset isn't even in memory - it's in a file.

First, I don't understand why you need to use source.view()

with an array numpy

. Yes, when selecting from an array or transforming an array, it numpy

tries to return view

, not a copy. But most (all?) Of the examples .view

involve some kind of transformation, like c dtype

. Can you please provide some sample code or documentation with just help .view()

?

I don't have much experience with h5py

, but its documentation says it, providing a thin ndarray-like wrapper around file objects h5

. Yours is DataSet

not ndarray

. For example, it lacks many methods ndarray

including view

.

But indexing a DataSet

returns ndarray

e.g. view_dset[:]

... Thus .value

. The first part of his documentation (via view_dset.value??

in IPython

):

Type:            property
String form:     <property object at 0xb4ee37d4>
Docstring:       Alias for dataset[()] 
...

Note that when you assign new values to the DataSet, you need to index source_dset

directly. Indexing the value doesn't work - except for changing the array. It does not change the file object.

And creating a dataset from an array doesn't bind them more stringently:

x = np.arange(10)
xdset = file.create_dataset('x', data=x)
x1 = xdset[:]

x

, xdset

And x1

all independent - not changing one changes the other.

Compare time

timeit np.sum(x)  #  11.7 µs
timeit np.sum(xdset) # 203 µs
timeit xdset.value #  173 µs
timeit np.sum(x1)  # same as for x

sum

array is much faster than dataset. Most of the extra time comes from creating an array from a dataset.

How to swap memory from HDF5 dataset using NumPy ndarray

More articles: