Probabilistic File Checking - Algorithm or Libraries?

Question

Probabilistic File Checking - Algorithm or Libraries?

I am looking for an efficient means to partially check the integrity of "large" datasets over slow transmissions. This seems to be a common problem as file sizes grow in proportion to the transfer rate.

For example, for specific numbers - a terabyte of data via USB2. Verifying that this data is still valid by reading each byte into a hash or checksum takes a day and increases the risk of disk failure.

Instead, this code would have to check random chunks of data and provide a likelihood of reality based on the available time. If allowed to run long enough, all blocks will be checked (basic example of reading the entire dataset).

Using "history":
- Data is stored in large encrypted containers (sizes 1 TB .. 1 GB).
- Each container is redundantly redundant across multiple sets of disks in different locations.
- Validation validation must be performed without knowledge of underlying data or keys.

In what failure modes should the DETECT method be used:
- Malfunctions of the storage transport (for example, the controller falls on a part of the physical address) - Sector errors (no data is returned for a specific block)
- Single bit errors (non-ECC memory or caches)

If errors are encountered, the data is recovered from redundant storage. Validation data should probably be kept separate.

Since the goal is data integrity, the techniques from file sharing networks do not seem to be applicable - the "hash tree" requires full storage of hashes at each node, which seems to be more than needed for scenarios without active attackers.

How to determine the trade-off between memory space and time for reading the corresponding blocks of the file?
If a hash tree / hash list is the best way, how safe is it to store the partial values of the hashes?
Would a checksum or error correction code be a better choice than hashes for equivalent protection?

+1

algorithm file hash

Meek janitor 23 nov. '08 at 5:51

source to share

3 answers

Nicola bonelli · Answer 1 · 2008-11-23T10:25:19+0000

Transfers are done via USB2, right? Therefore, you should know that:

USB communication as packets with payloads up to 1024 bytes for high speed transfers and 16-bit CRC.
Each packet is acknowledged and can be retransmitted.

You must consider this information to deploy an algorithm that adds some guarantees over the provided CRCs, otherwise it would be useless. If I remember well that 16-bit CRC can detect any single bursts of errors no more than 16 bits, and some of them longer.

You can start with wikipedia: http://en.wikipedia.org/wiki/USB2 and http://en.wikipedia.org/wiki/Cyclic_redundancy_check .

Hasturkun · Answer 2 · 2008-11-24T08:45:20+0000

You can try using something like PAR2 to generate redundancy data. this will allow you to check and correct the data and is likely convertible to use random access.

1800 INFORMATION · Answer 3 · 2008-11-23T05:55:39+0000

How to store hash or checksum values to run data in a file? Then you only have to read a limited portion of the data for limited validation of the contents of the file.

Probabilistic File Checking - Algorithm or Libraries?

More articles: