Best way to GetHashCode () for 44 bit number stored as Int64

I have about 5 million objects stored in Dictionary<MyKey, MyValue>

.

MyKey

is a structure that packs each component of my key (5 different numbers) into the rightmost 44 bit Int64

( ulong

).

Since it ulong

will always start with 20 zero bits, my gut feeling is that returning a native implementation Int64.GetHashCode()

is likely to collide more often than if the hashcode implementation only counted 44 bits, which is actually (although mathematically I don't know where to start proving this theory).

This increases the number of calls by .Equals()

and makes searching for dictionaries slower.

.NET implementation .NET Int64.GetHashCode()

looks like this:

public override int GetHashCode()
{
    return (int)this ^ (int)(this >> 32);
}

      

What is the best way to implement GetHashCode()

?

+3


source to share


2 answers


Here's my attempt at answering this question that I am posting, even though the answer is the opposite of what I was expecting. (Although I may have been wrong somewhere - I almost hope so, and I am open to criticism about my testing methodology.)

  // Number of Dictionary hash buckets found here:
  // http://stackoverflow.com/questions/24366444/how-many-hash-buckets-does-a-net-dictionary-use
  const int CNumberHashBuckets = 4999559;

  static void Main(string[] args)
  {
     Random randomNumberGenerator = new Random();

     int[] dictionaryBuckets1 = new int[CNumberHashBuckets];
     int[] dictionaryBuckets2 = new int[CNumberHashBuckets];

     for (int i = 0; i < 5000000; i++)
     {
        ulong randomKey = (ulong)(randomNumberGenerator.NextDouble() * 0x0FFFFFFFFFFF);

        int simpleHash = randomKey.GetHashCode();
        BumpHashBucket(dictionaryBuckets1, simpleHash);

        int superHash = ((int)(randomKey >> 12)).GetHashCode() ^ ((int)randomKey).GetHashCode();
        BumpHashBucket(dictionaryBuckets2, superHash);
     }

     int collisions1 = ComputeCollisions(dictionaryBuckets1);
     int collisions2 = ComputeCollisions(dictionaryBuckets2);
  }

  private static void BumpHashBucket(int[] dictionaryBuckets, int hashedKey)
  {
     int bucketIndex = (int)((uint)hashedKey % CNumberHashBuckets);
     dictionaryBuckets[bucketIndex]++;
  }

  private static int ComputeCollisions(int[] dictionaryBuckets)
  {
     int i = 0;
     foreach (int dictionaryBucket in dictionaryBuckets)
        i += Math.Max(dictionaryBucket - 1, 0);
     return i;
  }

      

I am trying to simulate how the processing done by the Dictionary would work. The OP says he has "about 5,000,000" objects in the Dictionary, and according to the source cited, there will be either 4999559 or 5999471 "buckets" in the Dictionary.



I then generate 5,000,000 random 44-bit keys to simulate the OP's Dictionary entries, and for each key I use two different ways, a simple ulong.GetHashCode () and the alternative way I suggested in a comment. I then iterate each hash into the bucket index using modulo - I assume this is done in the Dictionary. This is used to augment the pseudocode as a way to calculate the number of collisions.

Unfortunately (for me) the results are not as I hoped. Using codes 4999559, the simulation typically shows about 1.8 million collisions, with my "super-hash" technique actually having a few (about 0.01%) MORE collisions. With buckets 5999471 there are usually about 1.6 million collisions and my so-called super-hash gives maybe 0.1% fewer collisions.

So my gut feeling was wrong, and there seems to be no excuse for looking for a better hash code technique.

-2


source


I couldn't start to suggest a "better" way to hash 44-bit numbers. But I can suggest a way to compare it with 64 bit hashing algorithm.

One way to do this - just check how many collisions you get to a set of numbers (as suggested by McKenzie et al in. The choice of hash algorithm ) If you do not want to check all the possible values of your set, you will need to judge whether the number of collisions is acceptable that you get, okay. This can be done in code with something like:

var rand = new Random(42);
var dict64 = new Dictionary<int, int>();
var dict44 = new Dictionary<int, int>();
for (int i = 0; i < 100000; ++i)
{
    // get value between 0 and 0xfffffffffff (max 44-bit value)
    var value44 = (ulong)(rand.NextDouble() * 0x0FFFFFFFFFFF);
    var value64 = (ulong)(rand.NextDouble() * ulong.MaxValue);
    var hash64 = value64.GetHashCode();
    var hash44 = (int)value44 ^ (int)(value44>> 32);
    if (!dict64.ContainsValue(hash64))
    {
        dict64.Add(hash64,hash64);
    }
    if (!dict44.ContainsValue(hash44))
    {
        dict44.Add(hash44, hash44);
    }
}
Trace.WriteLine(string.Format("64-bit hash: {0}, 64-bit hash with 44-bit numbers {1}", dict64.Count, dict44.Count));

      

In other words, sequentially generate 100,000 random 64-bit values ​​and 100,000 random 44-bit values, perform a hash on each and keep track of unique values.



My test generated 99998 unique values ​​for 44-bit numbers and 99997 unique values ​​for 64-bit numbers. So this is less collision for 44-bit numbers over 64-bit numbers. I would expect fewer collisions with 44 bit numbers simply because you have less input.

I will not say that a 64-bit hash method is "best" for a 44-bit one; you will need to decide if these results are good for your circumstances.

Ideally, you should test with realistic values ​​that your application is likely to generate. Considering that these will all be 44-bit values, it is difficult to compare them to collisions ulong.GetHashCode()

(i.e. you will get the same results). If random values ​​based on constant seed are not good enough, change the code with something better.

While things may not "feel" right, science does not suggest changing anything without reproducible tests that prove change is necessary.

+4


source







All Articles