Re-gzip files in python

Question

Re-gzip files in python

I am writing a script in python to deploy static sites in aws (s3, cloudfront, route53). Since I don't want to download every file every time I deploy, I check which files have changed by comparing their md5 hash to their e-tag (which s3 sets as the hash representation of the md5 object). This works well for all files except my build script gzips before uploading. Looking inside the files, it looks like gzip is not a pure function; there are very minor differences in the output file every time gzip is executed, even if the original file has not changed.

My question is this: is there a way to get gzip to reliably and re-output the same file given the same input? Or am I better off checking if the file was gzipped, unzip it, and calculating the md5 / hash manually, setting the e-tag value for it instead?

+1

python gzip

Alex guerra Dec 22. 12 at 11:15

source to share

2 answers

gzip is unstable, as you understood correctly:

[root@dev1 ~]# touch a b
[root@dev1 ~]# gzip a
[root@dev1 ~]# gzip b
[root@dev1 ~]# md5sum a.gz b.gz
8674e28eab49306b519ec7cd30128a5c  a.gz
4974585cf2e85113f1464dc9ea45c793  b.gz

-1

Andreas Jung Dec 22. 12 at 11:20

source to share

Mark adler · Accepted Answer · 2012-12-22T17:04:04+0000

The compressed data is the same every time. The only thing that is different is most likely the modification time in the title. The fifth argument GzipFile

(if that's what you are using) allows you to specify the modification time in the header. The first argument is the filename, which is also included in the header, so you want to keep that. If you provide a fourth argument for the original data, then the first argument is only used to fill in the header part of the file name.

Re-gzip files in python

More articles: