Compressing integers

How can I compress a string of integers into something shorter?

eg. Input: '1 2 4 9 8 5 2 7 6 2 3 4' → Algorithm → Output: 'XYZ'

and can return it the other way around? ('XYZ' → '1 2 4 9 8 5 2 7 6 2 3 4')

The entry contains only 12 digits, only numbers. The output can be alphanumeric and must be no more than 3-4 digits.

Thanks in advance.

Edit: each input digit is 0-9; Output 0-9a-Z

+3


source to share


2 answers


First, you can use any existing compression algorithm through some library. However, knowing that your input is very specialized, you can also write a custom algorithm tailored to your case.

But first analyze how much you can compress this input. To keep it simple, I'll first look at compressing exactly 12 digits from 0 to 9 (you didn't explicitly specify which range of input data). There are 10 ^ 12 possible combinations, which is a little less than 2 ^ 40. So what you basically want to do is compress 40 bits.

Now let's analyze how you can compress those 40 bits. If you understand alphanumeric as [0-9A-Z]

, you have 36 characters. Each character can encode log_2 (36) = 5.1 bits. So it takes 8 alphanumeric characters to encode your 40 bits.

A better alternative would be to use base64. Here you have 64 characters, which means each character can encode 6 bits so that you can encode your input with only 40/6 = 6.666 => 7 characters.

If you think that compressing your input to binary, you obviously need 40 bits. This can be written with 5 8-bit ASCII characters with 2 32-bit integers, or 1 64-bit integer. However, this is probably not what you want to achieve.

Conclusion: you cannot compress data arbitrarily, and the data you want to compress cannot be compressed as much as you want to compress.




As an example, to encode 12 digits 0 through 9 to ASCII characters, you can simply convert them to one large number, convert them to binary, then take that binary number in 8 bit chunks and convert them to ASCII characters.

Example:

Input: 1 2 4 9 8 5 2 7 6 2 3 4
One number: 124985276234
Binary: 1110100011001101100111111011101001010
Grouped: 11101 00011001 10110011 11110111 01001010
ASCII: <GS><EM>  J

      

Please note that some ASCII characters cannot be printed. If this is important to you, you will have to use an alternative encoding, such as base 64, which only contains 64 different characters, but they are all printable.

+4


source


If your input comes from a specific domain, where many inputs are unlikely / unacceptable - you cannot do this.

You can encode 62^4~=1.4*10^7

different series with 4 alphanumeric characters. On the other hand, a 12 digit input can have 10 ^ 12 possible different inputs.

From the principle of the pigment hole - there should be 2 "compression", which are mapped to the same input.

But, since you will need to recreate the original sequence, you cannot tell the difference between two that are the same compressions.



So there is no such compression.

In fact, to compress a 12-digit number into 4 characters, you need an alphabet for 1000 characters:

x^4 = 10^12, x>0
x = 1000

      

+8


source







All Articles