Compressing multiple compressed zlib data streams into one stream efficiently
If I have multiple zlib compressed binary strings, is there a way to efficiently concatenate them into one compressed string without unpacking everything?
An example of what I need to do now:
c1 = zlib.compress("The quick brown fox jumped over the lazy dog. ")
c2 = zlib.compress("We ride at dawn! ")
c = zlib.compress(zlib.decompress(c1)+zlib.decompress(c2)) # Warning: Inefficient!
d1 = zlib.decompress(c1)
d2 = zlib.decompress(c2)
d = zlib.decompress(c)
assert d1+d2 == d # This will pass!
An example of what I want:
c1 = zlib.compress("The quick brown fox jumped over the lazy dog. ")
c2 = zlib.compress("We ride at dawn! ")
c = magic_zlib_add(c1+c2) # Magical method of combining compressed streams
d1 = zlib.decompress(c1)
d2 = zlib.decompress(c2)
d = zlib.decompress(c)
assert d1+d2 == d # This should pass!
I don't know too much about zlib and the DEFLATE algorithm, so this might be completely impossible from a theoretical point of view. Also, I have to use using zlib; so I am unable to wrap zlib and come up with a custom protocol that handles concatenated streams transparently.
NOTE. Actually I don't mind if the solution is not trivial in Python. I am ready to write C code and use ctypes in Python.
source to share
In addition to gzjoin, which requires decompression of the first deflation stream, you can take a look at gzlog.h and gzlog.c , which effectively adds short lines to the gzip file without having to unzip the deflation stream every time. (It can be easily modified to work with zlib-grouped data instead of gzip-wrapped deflate data.) You would use this approach if you control the creation of the first descent stream. If you are not creating the first descent stream, you will have to use the gzjoin approach, which requires decompression.
None of the approaches require recompression.
source to share