Is the md5 function safe for merging datasets?

We're going to be promoting a piece of code that uses a SAS hash function md5()

to efficiently track changes in a large dataset.

format md5 $hex32.;
md5=md5(cats(of _all_));

      

According to the documentation :

The MD5 function converts a string based on the MD5 algorithm to a 128-bit hash value. This hash value is called the message digest (digital signature), which is almost unique for each line that is passed to the function.

At about what stage, if any, does “almost unique” become a data integrity risk?

+3


source to share


2 answers


I've seen an example where md5 comparison goes wrong. If you have AB and CD values ​​in the first row (two columns) of the first and ABC and D in the second row, they get the same md5 value. See this example:

data md5;
  attrib a b length=$3 informat=$3.;
  infile datalines;
  input a b;
  format md5 $hex32.;
  md5=md5(cats(of _all_));
datalines;
AB CD
A BCD
;run;

      



This is, of course, because CATS (of _all_) will concatenate and split variables (convert numbers to string using the "best" format) without a delimiter. If you use CAT instead, this will not happen because leading and trailing spaces are not removed. This error is not very far off. If you are missing values, this may happen more often. If, for example, you have many binary values ​​in text variables, some of which are missing, this can happen very often.

You can do this manually by adding a separator between the values. Of course, you will still have a case where you have ("AB!" And "CD") and ("AB" and "! CD") and you use "!" as separator ...

+3


source


MD5 has 2 ^ 128 different values, and from what I've read at 2 ^ 64 different values ​​(that's 10 ^ 20 or so) you start to have a high probability of collision detection.



However, as a result of the MD5 creation, you have some collision risks from very similar preimages that differ by only two bytes. As such, it is difficult to tell how risky it is for your particular process. Of course, the likelihood of a collision can only occur on two messages. That is unlikely. Does [some] save you computational time enough to outweigh the small risk?

+2


source







All Articles