Iterating through hdf5 database is slow in python

I created an hdf5 file with 5000 groups, each containing 1000 datasets with h5py. Datasets are 1-D arrays of integers, averaging about 10,000 elements, although this can vary from dataset to dataset. The datasets were created without the memory storage option. The total size of the dataset is 281 GB. I want to iterate over all datasets in order to create a dictionary mapping dataset name to dataset length. I will use this vocabulary in the algorithm later.

Here's what I've tried.

import h5py

f = h5py.File('/my/path/to/database.hdf5', 'r')
lens = {}
for group in f.itervalues():
    for dataset in group.itervalues():
        lens[dataset.name] = dataset.len()

      

This is too slow for my purposes and I am looking for ways to speed up this procedure. I know it is possible to parallelize operations with h5py, but wanted to see if there was another approach before going down this road. I would be open to re-creating the database with different parameters / structure if it would make things faster.

+3
python hdf5


source to share


No one has answered this question yet

Check out similar questions:

5504
Does Python have a ternary conditional operator?
5231
What are metaclasses in Python?
4473
Calling an external command in Python
3790
How can I safely create a subdirectory?
3602
Does Python have a substring method "contains"?
3119
What is the difference between Python list methods that are appended and expanded?
2818
Finding the index of an element by specifying the list that contains it in Python
2679
Iterating over dictionaries using 'for' loops
2601
How can I make a time delay in Python?
2568
How to find the current time in Python



All Articles
Loading...
X
Show
Funny
Dev
Pics