Computing a hash of a data structure?

Let's say I want to calculate the hash of a data structure using a hash algorithm like MD5 that accepts a sequential stream to check for equivalence. (I want to write a hash and then recalculate the hash on the same or equivalent data structure later and check the hashes to evaluate equivalence with high probability.)

Are there standard methods for this?

The problems I see are problematic

  • If the data structure contains an array of binary strings, I cannot just concatenate them because ["abc", "defg"] and ["ab", "cdefg"] are not equivalent arrays
  • if the data structure contains a collection that is not guaranteed to be enumerated in the same order, e.g. key-value dictionary {a: "bc", d: "efg", h: "ijkl"}, which should be considered equivalent to the key-value pair {d: "efg", h: "ijkl", a: "bc" }.
+2


source to share


4 answers


In the first release, string lengths are also hashed. This will differentiate their hashes.



For the second, sort the keys.

+3


source


The "standard" way to do this is to define a serialized form of the data structure and digest the resulting byte stream.



For example, TBSCertificate is a data structure containing the subject name, extensions, and other information. This is converted to an octet string in a deterministic manner and hashed as part of the digital signature operation to generate the certificate.

+2


source


There is another problem with structs and that is the alignment of data members across platforms. If you want a stable and portable solution, you can solve this by implementing a "serialize" method on your data structure in such a way that serialize will generate a byte stream (or more often output to a byte stream). Then you can use a serialized stream hash algorithm. This way, you will be able to solve the problems you mentioned by explicitly looking up your data. As other additional features, you can save your data to hdd or send it over the network.

For strings, you can implement a Pascal storage type where the length is the first.

+1


source


  • If strings cannot have any nul characters, you can use C strings to ensure uniqueness, eg. "abc \ 0defg \ 0" is different from "cdefg \ 0".
  • For dictionaries, perhaps you can sort before hashing.

It also reminds me of a problem I once heard about ... I don't know what language you are using, but if you also hash C structures without any filtering, be careful about the space between fields that the compiler could enter for alignment. Sometimes they are not cleared.

+1


source







All Articles