Compressing multiple compressed zlib data streams into one stream efficiently

If I have multiple zlib compressed binary strings, is there a way to efficiently concatenate them into one compressed string without unpacking everything?

An example of what I need to do now:

c1 = zlib.compress("The quick brown fox jumped over the lazy dog. ")
c2 = zlib.compress("We ride at dawn! ")
c = zlib.compress(zlib.decompress(c1)+zlib.decompress(c2)) # Warning: Inefficient!

d1 = zlib.decompress(c1)
d2 = zlib.decompress(c2)
d = zlib.decompress(c)

assert d1+d2 == d # This will pass!

      

An example of what I want:

c1 = zlib.compress("The quick brown fox jumped over the lazy dog. ")
c2 = zlib.compress("We ride at dawn! ")
c = magic_zlib_add(c1+c2) # Magical method of combining compressed streams

d1 = zlib.decompress(c1)
d2 = zlib.decompress(c2)
d = zlib.decompress(c)

assert d1+d2 == d # This should pass!

      

I don't know too much about zlib and the DEFLATE algorithm, so this might be completely impossible from a theoretical point of view. Also, I have to use using zlib; so I am unable to wrap zlib and come up with a custom protocol that handles concatenated streams transparently.

NOTE. Actually I don't mind if the solution is not trivial in Python. I am ready to write C code and use ctypes in Python.

+3


source to share


2 answers


Since you don't mind getting involved in C, you can start with the gzjoin code .



Note that the gzjoin code needs to be unpacked to find the parts to change when merged, but not to recompress. This is not a bad thing, because decompression is usually faster than compression.

+4


source


In addition to gzjoin, which requires decompression of the first deflation stream, you can take a look at gzlog.h and gzlog.c , which effectively adds short lines to the gzip file without having to unzip the deflation stream every time. (It can be easily modified to work with zlib-grouped data instead of gzip-wrapped deflate data.) You would use this approach if you control the creation of the first descent stream. If you are not creating the first descent stream, you will have to use the gzjoin approach, which requires decompression.



None of the approaches require recompression.

+3


source







All Articles