Why is ZIP so efficient in System.Random generated sequences - where is Kolmogorov's complexity?
I am generating sequences of random numbers. The sequences only include 0s and 1. I put each sequence in a separate text file and then try to archive the file (in .zip format). I am using System.Random to generate the elements of each sequence. At first glance, the sequences appear to be truly random.
The strange thing is that regardless of the size of the generated .txt file, the size of the compressed .zip file is always the same ratio of ~ 17% of the size of the .txt file.
But in theory, for a truly random sequence, the compressed .zip file should be almost equal in size to the .txt file - that is, there should be almost no compression. Otherwise, the sequence is at least partially predictable (which is not possible in this flip experiment).
So my "archiver" knows how to recognize that the sequence is being generated by a specific pseudo-random generator implemented in System.Random.
I have 2 questions here:
-
how to create a pseudo-random sequence that will not be compressed by the archiver? Maybe there are some famous tricks?
-
why is the 17% ratio so stable and independent of the sequence length (i.e. the size of the .txt file).
Thanks for any help!
source to share
You are stating that you only store 0s and 1s in a text file. Thus, your file binary level consists entirely of occurrences of bit-sequences 00110000
, and 00110001
(which is the ASCII value for the character '0'
and '1'
). This is very wasteful and a good compression algorithm will realize that it can represent each of these 8-bit patterns with fewer bits: 1 is optimal, but probably a combination of 1 and 2 bits to get the 18% compression ratio you cite.
If you want to create a sequence that cannot be compressed, you need to create arbitrary unlimited values and write them as binaries. For example:
byte[] buffer = new byte[1024 * 1024]; // for a 1 MB file
(new Random()).NextBytes(buffer); // each byte gets a random value from 0 to 255
File.WriteAllBytes(target, buffer);
source to share