How to copy a multidimensional h5py dataset into a flat 1D Python list without any intermediate copies?

Question

How to copy data from N x N x N x...

h5py dataset into 1D standard Python list without intermediate copy of data?

I can think of several different ways to do this with an intermediate copy. For example:

import h5py
import numpy as np

# initialize list, put some initial data in it
myList = ['foo']

# open up an h5py dataset from a file on disk
myFile = h5py.File('/path-to-my-data', 'r')
myData = myFile['bar']
myData.shape        # returns, for example, (5,15,7)

# copy dataset over to a numpy array
arr = np.zeros(myData.shape)
myData.read_direct(arr)

# finally, add data from copied dataset to myList
myList.extend(arr.flatten())

      

Can this be done without an intermediate copy to a numpy array?

Some background

(you absolutely don't need to read this unless you're curious)

I am trying to copy data from an HDF5 file to a logs log file via their Python APIs. These are both libraries and structures for writing your own complex serializable data structures. In terms of their Python APIs, HDF5 pretends that its arrays are numpy arrays, whereas protocol buffers pretend that its arrays are standard Python 1D lists (unfortunately, plain buffers for simple multidimensional arrays don't have built-in support ). So I need to convert from h5py dataset to Python list.

Edit

There was a request for some clarification on what I meant by

HDF5 pretends that its arrays are numpy arrays, whereas protocol buffers pretend that its arrays are standard 1D Python lists

I mean that the h5py dataset provides an interface to the user that is similar to the interface opened with a numpy array, and that the Python Protobuf repeating numeric field provides an interface similar to the Python standard list interface, However, none of them implement the full behavior or even the complete interface of its prototype. For example, h5py datasets don't have a .flatten () method, and duplicate Pybuf fields complain if you try to assign other lists as elements (for example, it myBuf.repIntField[2] = [1,2,3]

always throws an error).

Here's the relevant line from the Pybuf documentation :

Duplicate fields are represented as an object that acts like a Python sequence.

And the relevant lines from the h5py documentation (emphasis mine):

Datasets are very similar to NumPy arrays . They are homogeneous collections of data items with an immutable data type and a (hyper) rectangular shape. Unlike NumPy arrays , they support many transparent storage features such as compression, error detection, and I / O communication.

+3


source to share


1 answer


For numpy arrays, I would suggest using ndarray.flat, but h5py. Datasets do not have a flat / flatten attribute.

You can create a generator that adds chunks to memory as numpy arrays and then gives values ​​from the flattened values. This can then be converted to a list. For example, to just fragment the outer size:



def yield_chunks(x):
    for chunk in iter(x):
        yield chunk.flat

myGenerator = itertools.chain(yield_chunk(arr))

      

myGenerator

will give individual values ​​from arr

. You convert this to a list with list(myGenerator)

.

+1


source







All Articles