Compressing a sequence of unique sorted numbers

I am a project I am working on, I have a sequence of numbers (about 2 billion). Each number is 4 bytes and is unique. The numbers are sorted. My goal is to read them into RAM as soon as possible in uncompressed format. This does not affect hard drive space.

If I store them uncompressed, I need 2 billion * 4 bytes = 8 GB. It will take about 100 seconds to read. I can store data as a sequence of bits and it would take 2 billion / 8 = 250MB to do that. It takes about 3 seconds to read.

I need to read and decompress them in about 0.1-0.5 seconds (if possible) using a regular hard drive. I don't care how long it takes to compress the data, but I really care how long it takes to decompress it, and I need it to be done in a few milliseconds.

The randomness of the numbers is unknown.

Question . What compression algorithm can compress numbers down to about 20-30 MB with a decompression time of 100-200 milliseconds using an i3-i5 processor?

EDIT . The maximum number in a sequence will be 2 billion. So I can store it on a 250MB bit array. The sequence size is not always 2 billion. It can contain from 1 to 2,000,000,000 numbers.

+3


source to share


2 answers


There are two possible approaches:



  • The applicant proposes to store the sequence of numbers as a bit string. For example: if the number i is in the sequence, then the i th bit of the bit string is set to one, otherwise it is zero. Naturally, first of all try to apply standard compression algorithms to this bit string and see what happens.

  • From the wording of the question, it seems that we can treat numbers in a sequence as 4 byte ints. So the stored sequence is about 2 * 10 9 out of a possible 2 32 ints. This means that the average difference between any two consecutive numbers cannot exceed ~ 2.147 = 2 32 / (2 * 10 9 ). Thus, perhaps calculating the sequence of differences and try to compress it. Since I would expect most of the sequential differences to be 1 and 2, I suspect that this sequence can be very compressible.

0


source


Your approach to storing it as a sequence of bits will work as well as you'd expect, but it would take 512 MB to have a bit for each four-byte integer, not 250 MB.

A delta encoding scheme would work better for a less dense set, but not this (as described in the original question, which was a random selection of half of the possible 32-bit integers). Here delta 1 will occur in about half the time, delta 2 in a quarter of the time, etc. This would result in bits 2 30 + 2x2 29 + 3x2 28 + ... = 2 32 . Same as bit-vector approach.

The optimal compression scheme would take a log base of 2 out of 2 32 select 2 31 bits. It also turns out to be bits 2 32 . (Actually, 2 is 32 -16 bits, so a whopping 16 bits out of four billion could be stored.)

So bit-bit is as good as he is.



The updated question is completely different. Now the question has a wide range of values ​​from one to an integer of 31 bit integers and asks how to compress that down to 20 MiB to 30 MiB.

These compressed sizes put a limit on the size of the set. Given the size of the set, one can simply count the number of possible subsets of 31-bit integers of that size, let's call it n. This number of possible subsets is 2 31 . "choose" is the binomial coefficient . The logarithmic base 2 of this number of possible subsets is the theoretical minimum compressed size of a particular subset in bits, assuming all such subsets are equally likely.

So now we can calculate the maximum possible size that can compress from 20 MiB to 30 MiB. This amounts to 21 to 34 million. You can also shrink subsets of size 2 31 minus 21 to 34 million, as you might think that they are identified by missing values ​​rather than what they are. Anything in between will take over 30 megabytes to represent the theoretically optimal compression scheme. The updated question proposes a full range of possible subsets, the vast majority of which are between $ 34 and $ 2.1 billion.

So, on the bottom line, it is impossible to compress the described sequences anywhere near as much as indicated in the updated question.

0


source







All Articles