Generating random functions (as opposed to random numbers)

I would like to create a function that takes a string and returns a number between 0 and 1. The function should consistently return the same number when given the same string, but other than that there should be no distinguishable pattern in the results. The pin numbers for any large set of input lines must follow a uniform distribution.

Also, I need to generate more than one such function, i.e. when given the string "abc", function A can sequentially return 0.593927, while function B sequentially returns 0.0162524. I need it to be fast (this is for numerical simulation) and have good enough statistics.

I am using Python and agree with the answers of the form: "here is an easy way to do it using the Python library" or "here is an algorithm you can implement". If there is no quick way to do this in Python, I'll just switch to C.

I understand that either of the following two methods will work, but each has drawbacks that lead me to look for a more elegant solution.

  • Save the dictionary
    I could just compute a new random number every time a new string is given to me, and store it in a dictionary that will be retrieved if I get the same string again. However , my application is likely to generate many lines that only appear once. This will eventually lead to the need to store a very large dictionary in memory. It also makes it difficult to repeat, since even if I use the same seed, I will create a different function if I get the same lines in a different order. For these reasons, it would be much better to consistently compute random numbers on the fly.

  • Use a hash function
    I could just call the hash function on a string and then convert the result to a number. The problem of creating multiple functions can be solved, for example, by adding a seed line to each input line. Howeverthen I am stuck trying to find a hash function with appropriate speed and statistics. Python's built-in hash is fast, but implementation dependent and I don't know how good the statistics are as they are not intended for this type of purpose. On the other hand, I could use a secure hashing algorithm like md5, which would have good statistics, but that would be too slow for my application. Hash functions designed for storage applications are usually much faster than cryptographically secure ones such as md5, but they are designed to avoid collisions, not to produce evenly distributed output, and they are not necessarily the same in all cases.

Additional note on hash functions

To illustrate that avoiding collisions and getting uniform results are two different things, consider the following example using Python's built-in hash function:

>>> hash("aaa") % 1000
340
>>> hash("aab") % 1000
343
>>> hash("aac") % 1000
342
>>> hash("aad") % 1000
337
>>> hash("aae") % 1000
336
>>> hash("aaf") % 1000
339
>>> hash("aag") % 1000
338
>>> hash("aah") % 1000
349
>>> hash("aai") % 1000
348
>>> hash("aaj") % 1000
351
>>> hash("aak") % 1000
350

      

There are no collisions in the above output, but they are also clearly unevenly distributed as they are all between 336 and 351 and a certain pattern is defined in the third digit as well. I realize that it might be better to get the statistics by doing (hash("aaa")/HASH_MAX)*1000

(assuming I can figure out what HASH_MAX

should be), but that should help illustrate that the requirements for a good hash function are not the same as the requirements for the function I'm looking for.

Some relevant information about the problem

I don't know exactly which strings this algorithm should use because the strings will be generated by the simulation, but most likely the following will be:

  • They will have a very limited character set (maybe just 4 or 5 different characters).

  • There will be many unique or rare strings and some very common strings of varying length.

  • There is no upper bound on line lengths, but short ones are likely to be much more common than long ones. I wouldn't be surprised if I don't see over 100 characters, but I don't know for sure. Many of them will only have one to three characters, so it is important that the algorithm is fast for short strings. (But I think I could use a lookup table for strings less than a certain length.)

  • Usually strings have large substrings - often two strings will differ by only one character added to the beginning or end. It is important that the algorithm does not produce similar outputs when the strings are similar.

+3


source to share


4 answers


Use a good random number generator and fill it with string.



+3


source


The hashing strings section contains the algorithm in the Wikipedia article on universal hashing .



Alternatively, you can simply use the built-in hash function; each of your random functions adds a random (but fixed) prefix to the string before hashing.

+1


source


Lookup3 is believed to have very good collision properties, which should imply an even distribution of results as well as fast. It should be easy to install this into a Python extension.

More generally, if you find a function that does a good job of minimizing hash table collisions and has the necessary speed properties, a final 32- or 64-bit integer to float conversion is required. There are many sources on the internet and other places for hashing strings. Check out Knuth for a start.

Adding

Another thing to try is to first encrypt the string with a fast 1-1 algorithm like RC4 (not secure, but still close enough to pseudo-random), and then run a trivial hash (h = h + a * c [i] + b ) above the cipher text. The RC4 key is unique.

+1


source


Try using a fingerprint like Rabin's fingerprint.
http://en.wikipedia.org/wiki/Fingerprint_ (calculations) .

If you choose N-bit fingerprint, you just need to divide the result by 2 ^ N.

Fingerprints are a kind of hash function that is generally very fast for a computer (compare to cryptographic hash functions like MD5), but not suitable for cryptographic applications (the key value can be recovered somehow using his fingerprint)

+1


source







All Articles