How to tell if a string is random in C / C ++

I am working on some encryptions creating an encrypted string like t7AZChkiBA

or?t7AZDxknCE34F14OuwuXiIcGBIKqSGya03CY9cf9iUSPuCb7owPUzmfTxoBsDzE533S82dvKqm7KmOkREtknHH30z6rLHAHg29COKjX9A6uZxh4fAlrRy

The length is not fixed. How can I find if this string is random and doesn't mean anything?
I don't have a dictionary. I'm trying to find some statistical correlation, but I don't know how.

+3


source to share


4 answers


I think this website, which has an online Shannon entropy calculator for an arbitrary string, a formula, and a pretty good explanation, will help:

http://www.shannonentropy.netmark.pl/



From this calculator, what you are looking for is the "metric entropy" which is equal to Shannon's entropy divided by the length of the string, which is a measure of the randomness of your string. It can take values ​​from 0 to 1, where 1 means that the string is evenly distributed randomly.

+1


source


The string is not a random bit string. It seems to be composed entirely of characters in some kind of alphabet. Symbols can be part of any fully randomized set of inputs.

To really check for randomness, you need to translate your ciphertext into a bit string. Then run one of the test applications defined by NIST or the German BSI to test for randomness and use the bit string as input.

To determine that this is not an accident, you can run a frequency analysis or determine if the Hamming distance differs from 0.5 in the ciphertext. If I take a good look at your random text, it is very likely that one of these tests will fail. While there are many other tests out there, you only need one failed test to show that it is not random.



Of course, since any bit string is equally likely to be random text, you can only show with some level of confidence that it is not random.

It is also quite possible to cheat random numerical tests. The fact that the ciphertext passes these tests does not mean that the cipher completely attacks targeted attacks.

+1


source


One technique - which would work better with longer strings - would be to generate a very large set of random sample rows and then calculate some statistics on them to get an idea of ​​the mean and standard deviation for the random input, which would allow you to get a raw percentage. the probability that any string entered will not be random. A combination of such tests - each using different statistics - should give you a more accurate test.

As far as statistics are concerned, this may slightly depend on what kind of non-random input you expect (for example, whether you need to "protect" from user non-standard input intended to trick your program):

  • average "distance" (subtracting ASCII values) between adjacent letters

  • number of samples in different ranges (for example, split A-Za-z0-9 into 10 ranges and see if the # character in each range matches, as it should for random inputs)

  • counts how often any subsequence is repeated later in the string, possibly the other way around

  • the number of words of the dictionary, possibly with a minimum length to avoid noise

  • checking if uppercase letters, lowercase consonants, lowercase vowels, numbers will display roughly proportional to the number of such characters in the input format (for example, if you have 52 characters and 10 digits = 62 possible characters of the value, you would expect lowercase vowels are 5/62 of the line length on average, and could calculate the standard deviation to tell you how significant the higher / lower value was.

  • checking how often specific bits are set in incoming characters.

0


source


One way to separate the random from the non-random is to try to compress the string. A non-random string will compress more than a random string. Of course, the problem with a string generated with a good encryption method is that any such method is designed to output strings that appear to be random, and therefore will pass a lot of tests for randomness. Even decrypting with the wrong key will still give an accidental exit.

0


source







All Articles