Does the order of data in a text file describe the compression ratio?

Question

Does the order of data in a text file describe the compression ratio?

I have 2 large text files (csv to be precise). Both have the same content, except that the lines in one file are in the same order and the lines in the other file are in a different order.

When I compress these 2 files (programmatically using DotNetZip), I notice that one of the files is always significantly larger - for example, one file is 7MB larger than the other.

My questions:

How does the order of the data in a text file affect compression, and what steps can you take to ensure the best compression ratio? - I guess that similar lines grouped together (at least in the case of the ZIP files I use) would help the compression, but I am not familiar with the internals of the various compression algorithms and I would appreciate a quick explanation on the matter ...

Which algorithm handles this scenario better in the sense that it achieves the best average compression regardless of the order of the data?

+3

algorithm compression zip

Icarus Feb 14 13 at 18:11

source to share

4 answers

In a sense, it is a measure of the entropy of a file that determines how well it will compress. So, yes, order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z

that are repeated over and over. It will compress very well with most algorithms because it is very orderly. However, if you completely randomize the order (but leave the same counter for each letter), then it has accurate data (although it has a different "meaning"). It's the same data in a different order, and it won't shrink either.

In fact, because I was curious, I just tried this. I filled an array with 100,000 characters a-z

by repeating, wrote it to a file, then shuffled that array "by accident" and wrote it again. The first file is compressed to 394 bytes (less than 1% of the original size). The second file is compressed to 63,582 bytes (over 63% of the original size).

+7

Mark wilkins Feb 14 13 at 18:15

source to share

A typical compression algorithm works as follows. Look at a piece of data. If it is identical to some other recently seen fragment, do not print the current fragment literally, print a link to this earlier fragment.

It certainly helps when similar pieces are close together. The algorithm will only store a limited amount of inverse data to reduce the compression speed. Therefore, even if a piece of data is identical to some other piece, if this old piece is too old, it may already be deleted.

+3

nm Feb 14 13 at 18:34

source to share

Sure. If the input pattern is fixed, the probability of predicting a character at each position is 100%. Given that the two parties are aware of their data stream (which essentially means they know a fixed pattern), there is almost nothing to communicate: full compression is possible (to transmit strings of finite length, not unlimited streams, d is still need to encode the length, but this kind near the point). If the other party doesn't know the pattern, all you have to do is encode it. Full compression is possible because you can encode an unlimited stream with a finite amount of data.

On the other hand, if you have absolutely random data, so the stream can be anything, and the next character can always be any valid character - no compression is possible. The stream must be completely transferred to the other side to be able to restore the correct stream.

Ending lines are a little more complicated. Since the ending lines necessarily contain a fixed number of instances of each character, the probabilities should change after you start reading the initial tokens. You can read some sort of order in any ending line.

Not sure if this question will answer your question, but it is more than theoretically related to things.

0

Patrick87 15 Feb 13 at 16:19

source to share

Mark adler · Accepted Answer · 2013-02-14T23:16:07+0000

The "how" has already been answered. To answer your "what" question:

The larger the window to match, the less sensitive the algorithm will be in order. However, all compression algorithms will be sensitive to some degree.

gzip has a 32K window, bzip2 has a 900K window and xz is 8MB. xz can go to a 64MB window. This way xz will be the least order sensitive. Farther matches will require more bits to encode, so you can always improve compression, for example with sorted entries, regardless of the window size. Short windows simply exclude remote matches.

Does the order of data in a text file describe the compression ratio?

More articles: