Iterating through hdf5 database is slow in python

I created an hdf5 file with 5000 groups, each containing 1000 datasets with h5py. Datasets are 1-D arrays of integers, averaging about 10,000 elements, although this can vary from dataset to dataset. The datasets were created without the memory storage option. The total size of the dataset is 281 GB. I want to iterate over all datasets in order to create a dictionary mapping dataset name to dataset length. I will use this vocabulary in the algorithm later.

Here's what I've tried.

import h5py

f = h5py.File('/my/path/to/database.hdf5', 'r')
lens = {}
for group in f.itervalues():
    for dataset in group.itervalues():
        lens[] = dataset.len()


This is too slow for my purposes and I am looking for ways to speed up this procedure. I know it is possible to parallelize operations with h5py, but wanted to see if there was another approach before going down this road. I would be open to re-creating the database with different parameters / structure if it would make things faster.


source to share

All Articles