Linking to HDF dataset in H5py using astip
From the h5py docs , I see that I can use the HDF dataset as a different type using the method astype
for datasets. This returns a context manager that performs the transformation on the fly.
However, I would like to read into a dataset stored as uint16
and then convert it to a type float32
. After that, I would like to extract various pieces from this dataset in another function as a cast type float32
. The docs explain the use as
with dataset.astype('float32'):
castdata = dataset[:]
This will read the entire dataset and convert to float32
, which I don't want. I would like to have a reference to a dataset, but the quality is float32
equivalent numpy.astype
. How do I create a reference to an object .astype('float32')
so that I can pass it to another function for use?
Example:
import h5py as HDF
import numpy as np
intdata = (100*np.random.random(10)).astype('uint16')
# create the HDF dataset
def get_dataset_as_float():
hf = HDF.File('data.h5', 'w')
d = hf.create_dataset('data', data=intdata)
print(d.dtype)
# uint16
with d.astype('float32'):
# This won't work since the context expires. Returns a uint16 dataset reference
return d
# this works but causes the entire dataset to be read & converted
# with d.astype('float32'):
# return d[:]
Also, it looks like the astip context only applies when accessing data items. It means that
def use_data():
d = get_data_as_float()
# this is a uint16 dataset
# try to use it as a float32
with d.astype('float32'):
print(np.max(d)) # --> output is uint16
print(np.max(d[:])) # --> output is float32, but entire data is loaded
So, is there no way to use astipia with numpy-esque?
source to share
d.astype()
returns an object AstypeContext
. If you look at the source for AstypeContext
, you get a better idea of ββwhat's going on:
class AstypeContext(object):
def __init__(self, dset, dtype):
self._dset = dset
self._dtype = numpy.dtype(dtype)
def __enter__(self):
self._dset._local.astype = self._dtype
def __exit__(self, *args):
self._dset._local.astype = None
When you type AstypeContext
, the attribute of ._local.astype
your dataset is updated to the new desired type, and when you go out of context, it reverts back to the original value.
This way you can get more or less the behavior you are looking for, for example:
def get_dataset_as_type(d, dtype='float32'):
# creates a new Dataset instance that points to the same HDF5 identifier
d_new = HDF.Dataset(d.id)
# set the ._local.astype attribute to the desired output type
d_new._local.astype = np.dtype(dtype)
return d_new
When you read now d_new
, you float32
end up with numpy arrays, not uint16
:
d = hf.create_dataset('data', data=intdata)
d_new = get_dataset_as_type(d, dtype='float32')
print(d[:])
# array([81, 65, 33, 22, 67, 57, 94, 63, 89, 68], dtype=uint16)
print(d_new[:])
# array([ 81., 65., 33., 22., 67., 57., 94., 63., 89., 68.], dtype=float32)
print(d.dtype, d_new.dtype)
# uint16, uint16
Note that this does not update the attribute .dtype
d_new
(which appears to be immutable). If you also wanted to change the attribute dtype
, you probably need a subclass for that h5py.Dataset
.
source to share
The docs astype
seem to imply that reading it all into a new place is his goal. So yours return d[:]
is most sensible if you are reusing float-casting with many features in individual cases.
If you know what you need to cast and only need it once, you can switch around and do something like:
def get_dataset_as_float(intdata, *funcs):
with HDF.File('data.h5', 'w') as hf:
d = hf.create_dataset('data', data=intdata)
with d.astype('float32'):
d2 = d[...]
return tuple(f(d2) for f in funcs)
In any case, you have to make sure that it is hf
closed before exiting the function, otherwise you will run into problems later.
In general, I would suggest separating casting and loading / creating dataset completely and passing the dataset as one of the function parameters.
The above can be called like this:
In [16]: get_dataset_as_float(intdata, np.min, np.max, np.mean)
Out[16]: (9.0, 87.0, 42.299999)
source to share