Corrupt files when creating HDF5 files without closing them (h5py)

I am using h5py to store experiment data in an HDF5 container.

In an interactive session, I open the file using:

measurement_data = h5py.File('example.hdf5', 'a')

      

Then I write the data to a file using some self-writing functions (maybe a lot of GB of data from an experiment of several days). At the end of the experiment, I used to close the file using

measurement_data.close()

      

Unfortunately, it happens from time to time that an interactive session ends without me explicitly closing the file (accidentally killing the session, power outages, OS crash due to some other software). This always results in file corruption and loss of complete data. When I try to open it, I get an error:

OSError: Unable to open file (File signature not found)

      

I also cannot open the file in HDFview or any other software I have tried.

  • Is there a way to avoid a corrupt file even if it is not explicitly closed? I read about using the instructions here , but I'm not sure if it would help when the session ends unexpectedly.
  • Is there any way to recover data in damaged files? Is a repair program available?

Always opening and closing a file for every write access sounds rather unfavorable to me because I am constantly writing data from many different functions and streams. So I would be more happy with another solution.

+3


source to share


2 answers


The problem of corruption is known to the HDF5 designers. They are working to fix this in 1.10 by adding logging . At the same time, you can call periodically flush()

to make sure your records have been reset, which should minimize some of the damage. You can also use xrefs , which will allow you to keep chunks of data in separate files, but link them together into a single structure as you read them.



+4


source


Nothing will prevent the file from being damaged in the event of a power outage, for example. All you can do is minimize the damage. One way to do this is to use redundancy. You are using two files instead of one, and only one of them opens at any time. Let's say file 1 is open, you write all your changes to file 1. After a certain amount of time or a certain amount of data written, close it, update file 2 from file one, and continue writing to file 2, etc. Alternating.



+1


source







All Articles