Linking to HDF dataset in H5py using astip

From the h5py docs , I see that I can use the HDF dataset as a different type using the method astype

for datasets. This returns a context manager that performs the transformation on the fly.

However, I would like to read into a dataset stored as uint16

and then convert it to a type float32

. After that, I would like to extract various pieces from this dataset in another function as a cast type float32

. The docs explain the use as

with dataset.astype('float32'):
   castdata = dataset[:]

      

This will read the entire dataset and convert to float32

, which I don't want. I would like to have a reference to a dataset, but the quality is float32

equivalent numpy.astype

. How do I create a reference to an object .astype('float32')

so that I can pass it to another function for use?

Example:

import h5py as HDF
import numpy as np
intdata = (100*np.random.random(10)).astype('uint16')

# create the HDF dataset
def get_dataset_as_float():
    hf = HDF.File('data.h5', 'w')
    d = hf.create_dataset('data', data=intdata)
    print(d.dtype)
    # uint16

    with d.astype('float32'):
    # This won't work since the context expires. Returns a uint16 dataset reference
       return d

    # this works but causes the entire dataset to be read & converted
    # with d.astype('float32'):
    #   return d[:]

      

Also, it looks like the astip context only applies when accessing data items. It means that

def use_data():
   d = get_data_as_float()
   # this is a uint16 dataset

   # try to use it as a float32
   with d.astype('float32'):
       print(np.max(d))   # --> output is uint16
       print(np.max(d[:]))   # --> output is float32, but entire data is loaded

      

So, is there no way to use astipia with numpy-esque?

+3


source to share


2 answers


d.astype()

returns an object AstypeContext

. If you look at the source for AstypeContext

, you get a better idea of ​​what's going on:

class AstypeContext(object):

    def __init__(self, dset, dtype):
        self._dset = dset
        self._dtype = numpy.dtype(dtype)

    def __enter__(self):
        self._dset._local.astype = self._dtype

    def __exit__(self, *args):
        self._dset._local.astype = None

      

When you type AstypeContext

, the attribute of ._local.astype

your dataset is updated to the new desired type, and when you go out of context, it reverts back to the original value.

This way you can get more or less the behavior you are looking for, for example:



def get_dataset_as_type(d, dtype='float32'):

    # creates a new Dataset instance that points to the same HDF5 identifier
    d_new = HDF.Dataset(d.id)

    # set the ._local.astype attribute to the desired output type
    d_new._local.astype = np.dtype(dtype)

    return d_new

      

When you read now d_new

, you float32

end up with numpy arrays, not uint16

:

d = hf.create_dataset('data', data=intdata)
d_new = get_dataset_as_type(d, dtype='float32')

print(d[:])
# array([81, 65, 33, 22, 67, 57, 94, 63, 89, 68], dtype=uint16)
print(d_new[:])
# array([ 81.,  65.,  33.,  22.,  67.,  57.,  94.,  63.,  89.,  68.], dtype=float32)

print(d.dtype, d_new.dtype)
# uint16, uint16

      

Note that this does not update the attribute .dtype

d_new

(which appears to be immutable). If you also wanted to change the attribute dtype

, you probably need a subclass for that h5py.Dataset

.

+1


source


The docs astype

seem to imply that reading it all into a new place is his goal. So yours return d[:]

is most sensible if you are reusing float-casting with many features in individual cases.

If you know what you need to cast and only need it once, you can switch around and do something like:

def get_dataset_as_float(intdata, *funcs):
    with HDF.File('data.h5', 'w') as hf:
        d = hf.create_dataset('data', data=intdata)
        with d.astype('float32'):
            d2 = d[...]
            return tuple(f(d2) for f in funcs)

      

In any case, you have to make sure that it is hf

closed before exiting the function, otherwise you will run into problems later.



In general, I would suggest separating casting and loading / creating dataset completely and passing the dataset as one of the function parameters.

The above can be called like this:

In [16]: get_dataset_as_float(intdata, np.min, np.max, np.mean)
Out[16]: (9.0, 87.0, 42.299999)

      

0


source







All Articles