Hash function combination - is there a significant reduction in collision risk?

Does anyone know if there is a real benefit regarding reducing the chance of collisions by combining hash functions? I especially need to know this in relation to 32-bit hashing, namely for combining Adler32 and CRC32. Basically, does adler32 (crc32 (data)) give less chance of collision than crc32 (data)? Last comment heregives some test results in favor of combining, but the source is not mentioned. For my purpose, the collision is not critical (i.e. the task is not security related), but I would rather minimize the likelihood if possible. PS: I'm just getting started in the wonderful world of hashing, I read a lot about it. Sorry if I asked a stupid question, I haven't acquired the proper "hashing dialect" yet, maybe my google searches were also poorly formed. Thank.

+2


source to share


1 answer


It doesn't make sense to combine them consistently. You are putting one 32-bit space into another 32-bit space.

In the case of a crc32 collision in the first step, the end result is still a collision. Then you add any potential collisions in the adler32 stage. So it cannot get better, and it can only be the same or worse.

To reduce collisions, you can try something like using two hashes independently of each other to create a 64-bit output space:

adler32 (data) <32 | CRC32 (data)



Whether there is a significant benefit to this, I'm not sure.

Note that the original comment you talked about stores hashes independently:

Whichever algorithm you use there will be some chance of false positives. However, you can reduce these odds by a big difference by using two different hashing algorithms. If you were to calculate and store both CRC32 and Alder32 for each URL, the chances of a collision for both hashes for any given pair of URLs is reduced.

Of course, this means saving twice a lot of information, which is part of your original problem. However, there is a way to store both sets of data hashes so that it requires minimal memory (10kB or so), giving almost the same lookup performance (15 microsecs / lookup versus 5 microsecs) as Perl hashes.

+6


source







All Articles