Pandas and h5py load the same data (ndarray) differently

Question

Pandas and h5py load the same data (ndarray) differently

I have a HDF5 file . It was built using the HDF5 C ++ API using the following features:

struct SignalDefH5
{
    char  id   [128];
    char  name [ 64];
    char  units[ 16];
    float min;
    float max;
    hvl_t tags; /* This right there does not work in Pandas... */
};

struct TagDefH5
{
    char  tag [ 64];
    char  desc[256];
};

If I upload a file using h5py I get this:

>>> import h5py
>>> hfile = h5py.File('test.h5', 'r')
>>> signals = hfile['/signals']
>>> signals[0]
('id1', 'a pressure', 'bar', 0.0, 300.0, ['Pressure'])
>>> type(signals[0][5])
numpy.ndarray

However, if I use Pandas to download the same file, I get this:

>>> store = pd.HDFStore('test.h5')
>>> store.root.signals
/signals (Table(179,)) ''
  description := {
  "id": StringCol(itemsize=128, shape=(), dflt='', pos=0),
  "name": StringCol(itemsize=64, shape=(), dflt='', pos=1),
  "units": StringCol(itemsize=16, shape=(), dflt='', pos=2),
  "min": Float32Col(shape=(), dflt=0.0, pos=3),
  "max": Float32Col(shape=(), dflt=0.0, pos=4),
  "tags": StringCol(itemsize=64, shape=(), dflt='', pos=5)}
  byteorder := 'little'
  chunkshape := (234,)
>>> store.root.signals[0]
('id1', 'a pressure', 'bar', 0.0, 300.0, '\x02\x00\x00\x00\x00\x00\x00\x00\xf0f\x1e\x04\x00\x00\x00\x00\xba\nVT\xd1!\xa7\xdd\xb0\xe3\x9a\x02\x00\x00\x00\x00@\xecR\x1f\xa2\x7f\x00\x00}B\x178\x96\xa4u\xe6\xb0\xdd\x7f\x02\x00\x00\x00\x00 \x01')
>>> type(store.root.signals[0][5])
numpy.string_

There is obviously a problem with Pandas : what did I do wrong?

Python version is 2.7.5.
h5py version is 2.4.0.
Pandas version 0.16.0.
PyTables version is 3.1.1.

+3

python scipy pandas hdf5

Sardathrion May 01 '15 at 12:26

source to share

1 answer

Jeff · Accepted Answer · 2015-05-01T14:13:24+0000

Pandas HDF5 support uses PyTables

. This provides a layer of metadata on top, which itself (which stands for PyTables) on top of the original HDF5. h5py

is a pretty crude HDF5.

So the subfield is not known until pandas for example. what it really is. You get a raw-bytes string.

Nested structures like these are simply not supported. They do not represent pandas structures well. Next, by creating this file in raw HDF5, you are missing a lot of metadata that pandas needs to interpret the data.

Just use PyTables/pandas

to write your data. Then you can reverse engineer this format in C ++.

Pandas and h5py load the same data (ndarray) differently

More articles: