Low collision hash required. Inputs are a permutation of the same 72 bits
For an associative array, I am hashing 72 bits to 26 bits and need an extremely low collision rate. Speed is also a major factor. I started at https://en.wikipedia.org/wiki/Hash_function#Hashing_By_Nonlinear_Table_Lookup It's fast enough, but in testing I got collisions in the first 100 entries!
for(=0; k<9; k++)
hash ^= randombits[byteptr[k]];
hash &= (1<<HASHBITS)-1
The problem is that all my inputs come from each other - there are initial 72 bits, and subsequent inputs are made from the previous one, by swapping two bit pairs (aligned on even bit boundaries) with each other. The simple XOR loop above is unfortunately inferior. An obvious approach would be to use the byte index "k" to somehow change the hash on each iteration. In addition to the code suggestions, I'm looking for a theoretical background - I want to have a good idea of what the probability of a collision will be. Reading the docs I don't find anything about hash efficiency on permutations. I'm sure crypto hashes will do well, but they look slow. Suggestions and pointers for my research are welcome, and thanks. FWIW I am using ARC4 to generate "randombits", an array of 256 u_int32_t.
EDIT for clarity: bit-transpositions in the input look like this: for even numbered N and X less than 72, the (N: N + 1) bit is replaced by the (X: X + 1) bits, for the random X and N. All such swaps are actually change state - identical bits are never swapped.
source to share
Here's a solution that resolved my problem: in the loop I placed, between each XOR, I rotate the hash by k bits, where k is the loop index. In my test collisions it is now just as unusual with permutations of the input data as with single bit flips. It seems like the "birthday paradox" is what helps me reduce the collision rate, but 26 bits is all I can afford - the hash table is already huge.
source to share
Appliyng the birthday paradox , you can see that statistically you will get a collision probability of over 0.5 after hashing (2 ** 13 = 8000).
Bast try is to use crypto hashing or some hash function based on a good and fast PRNG (Xorshift comes to my mind). You can try MurmurHash3 (the author claims it passes some statistical tests and is pretty fast)
https://code.google.com/p/smhasher/source/browse/branches/chandlerc_dev/MurmurHash3.cpp
source to share