Invariable Segfault when trying to access an object in h5py.Dataset

Question

Invariable Segfault when trying to access an object in h5py.Dataset

I was wondering if there are any special restrictions on data access through h5py? I am expecting an array with internal arrays from a matching dataset.

I can load the object into memory, access most of the keys in the data file.

>>> import h5py

>>> fileobj = h5py.File('path/to/file.ext')
>>> test = fileobj['SomeKey']
>>> test.dtype
dtype('float64')
>>> test[:]
array([  3.50000460e+02,   1.23662217e-03,   1.23662872e-03, ...,
     9.94521356e-03,   9.94531916e-03,   9.94542476e-03])
>>> test.shape
(49682960,)
>>> # this loads fine
... problem = fileobj['SomeOtherKey']
>>> problem.shape
(13570,)
>>> # accessing specific keys works fine too
... sub_dset = problem['subkey1']
>>> sub_dset.dtype
dtype('O')
# checking compression, nothing out of the ordinary...
>>> problem._filters
{'gzip': 1, 'shuffle': (144,)}
>>> problem.dtype['subkey2']
dtype('O')
>>> problem['subkey2']
Segmentation fault

This also happens when I try to use slices, copy datasets, etc.

>>> problem['subkey2'][:]
Segmentation fault

>>> problem['subkey2']
>>> import numpy as np
>>> arr = np.zeros(problem.shape)
>>> # this works fine
... ds = fileobj.create_dataset('ds', data=problem['subkey1'])
>>> # this also works fine
... ds1 = f.create_dataset('ds1', problem.shape, problem.dtype, data=problem['subkey1'])
>>> # however.....
... ds2 = f.create_dataset('ds2', problem.shape, problem.dtype, data=problem)
Segmentation Fault
>>> ds = f.create_dataset('ds2', data=problem['subkey2'])
Segmentation Fault

At first I thought it might be a memory issue. However, I tested it with a standard file used as a test (now deprecated, I suppose) library:

https://github.com/jmchilton/pymz5/blob/master/test_data/test.mz5

In this case, the problem is reproduced in the following example:

>>> fileobj = h5py.File('test.mz5')
>>> problem = fileobj['SpectrumMetaData']
>>> problem.shape
(26,)
>>> problem['precursors']
Segmentation Fault

I checked and the "id" (among other keys) for SpectrumMetaData datasets is fine, and this example has no compression filters, assuming the segmentation fault is caused by the data itself.

Just in case it is Python version or h5py-specific, I ran all previous tests on h5py version 2.5.0 with Python 2.7.9.

When I try it on h5py version 2.2.1 (Ubuntu, Python 3.4.3) I get:

TypeError: No NumPy equivalent for TypeVlenID exists

I know that variable length dtype support has been improved in h5py since version 2.3 , so I upgraded to h5py 2.5.0 in Python 3 and get the same problems as before.

How can I access this data using h5py from a higher level API? I would rather not go into creating a custom datatype in Cython if possible.

+3

python hdf5 h5py

Alexander Huszagh 01 jul. '15 at 3:45

source to share