Does Java library provide perfect hashing?

I did some searches and found some helpful posts about support for perfect (i.e. no collision) hashing in Java.

Why doesn't Java hashcode support universal hashing?

Is it possible in java to do something like Comparator but to implement custom equals () and hashCode ()

But I'm looking for a practical solution, hopefully in the form of a tested library. I have a situation that suits perfect hashing: in essence, we can assume that the set of keys is fixed and the program runs for a long time and does a lot of searches. (This is not entirely true, but keys are added rarely enough that this is a close enough approximation that if I have to periodically rewrite or do something about it, then OK).

Basically, I would like to be able to increase the load factor as well as reduce collisions. In other words, the goal is to reduce memory usage and increase throughput (ie, Searches per second).

There are some problems. Obviously there is a problem that if it hashCode()

does not return individual values ​​then perfect hashing is impossible. There are other considerations besides the hashing algorithm such as complexity hashCode()

(or should I cache hashcodes for key objects, etc.) or whatever function I use to initially map my objects to integers or longs.

What I'm guessing is the ability to re-hash on a background thread, trying different hash functions to find the perfect one, or at least a good one. However, I am open to another solution. And I would like to use tested code rather than write myself, although I am open to that too.

+3


source to share


3 answers


You don't need perfect hashing if your data is random enough. Mitzenmacher has a neat article explaining why perfect hashing is difficult in practice, and why it is (usually) overkill in practice. I'll give you a link and paste it in the title so you can find it if the link disappears.

http://people.seas.harvard.edu/~salil/research/streamhash-Jun10.pdf

Why Simple Hash Functions Work: Using Entropy in a Data Stream



Michael Mitzenmacher Salil Wadhan School of Engineering and Applied Sciences

June 23, 2010

Hashing is fundamental to many algorithms and data structures that are widely used in practice. There were two main approaches to theoretical analysis of hashing. First, it can be assumed that the hash function is indeed random by matching each piece of data independently and evenly across the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, you can provide a rigorous performance estimate using explicit families of hash functions, such as 2-generic or O (1) -sequentially independent families. For such families, performance guarantees are often noticeably weaker than for perfect hashing.

In practice, however, it is commonly observed that simple hash functions, including 2-universal hash functions, perform as predicted by idealized analysis for truly random hash functions. In this article, we are trying to explain this phenomenon. We demonstrate that the strong performance of generic hash functions in practice can arise naturally from the combination of the randomness of the hash function and the data. In particular, following a large body of literature on random sources and randomness extraction, we model the data as coming from a "block source", with the result that each new piece of data has some "entropy" given the previous ones. As long as the entropy (Renyi) per item is large enough, it turns outthat the performance when choosing a hash function from a 2-generic family is essentially the same as for a truly random hash function. We describe results for several sample applications, including linear probing, balanced distributions, and Bloom filters.
+2


source


I don't understand why you want to repeat the hash on a background thread. What will ensure that the new hash table has fewer collisions? Maybe if you are looking for collosion and re-hash with a different hash function. But what if some of the new hash codes are still in the table? Re-hash until the number of collisions is zero? Nothing guarantees that you won't have colloids. See the bithday problem for proof: http://en.wikipedia.org/wiki/Birthday_problem .

I think you need a good hash function that has good collision resistance. I share my research with you. Hope this helps!



The best collision resistance hash function I have found is carc32 . With this function, the probability of collisions between any of the objects N is equal to (N - 1) / 2^32

. Here the second post will surprise you. There is another study out there that reinforces this. There is a built-in class for this: CRC32

0


source


Use a crypto library like BouncyCastle to provide better hash functions. See Hash String over SHA-256 in Java .

Another option seemed to be something like http://www.anarres.org/projects/jperf/ , but I haven't tried it myself.

-2


source







All Articles