Iterating through hdf5 database is slow in python

Question

Iterating through hdf5 database is slow in python

I created an hdf5 file with 5000 groups, each containing 1000 datasets with h5py. Datasets are 1-D arrays of integers, averaging about 10,000 elements, although this can vary from dataset to dataset. The datasets were created without the memory storage option. The total size of the dataset is 281 GB. I want to iterate over all datasets in order to create a dictionary mapping dataset name to dataset length. I will use this vocabulary in the algorithm later.

Here's what I've tried.

import h5py

f = h5py.File('/my/path/to/database.hdf5', 'r')
lens = {}
for group in f.itervalues():
    for dataset in group.itervalues():
        lens[dataset.name] = dataset.len()

This is too slow for my purposes and I am looking for ways to speed up this procedure. I know it is possible to parallelize operations with h5py, but wanted to see if there was another approach before going down this road. I would be open to re-creating the database with different parameters / structure if it would make things faster.

+3

python hdf5

COM 06 oct. 14 at 19:14

source to share

No one has answered this question yet

Check out similar questions:

5504

Does Python have a ternary conditional operator?

5231

What are metaclasses in Python?

4473

Calling an external command in Python

3790

How can I safely create a subdirectory?

3602

Does Python have a substring method "contains"?

3119

What is the difference between Python list methods that are appended and expanded?

2818

Finding the index of an element by specifying the list that contains it in Python

2679

Iterating over dictionaries using 'for' loops

2601

How can I make a time delay in Python?

2568

How to find the current time in Python

Iterating through hdf5 database is slow in python

More articles: