Re-gzip files in python

I am writing a script in python to deploy static sites in aws (s3, cloudfront, route53). Since I don't want to download every file every time I deploy, I check which files have changed by comparing their md5 hash to their e-tag (which s3 sets as the hash representation of the md5 object). This works well for all files except my build script gzips before uploading. Looking inside the files, it looks like gzip is not a pure function; there are very minor differences in the output file every time gzip is executed, even if the original file has not changed.

My question is this: is there a way to get gzip to reliably and re-output the same file given the same input? Or am I better off checking if the file was gzipped, unzip it, and calculating the md5 / hash manually, setting the e-tag value for it instead?

+1


source to share


2 answers


The compressed data is the same every time. The only thing that is different is most likely the modification time in the title. The fifth argument GzipFile

(if that's what you are using) allows you to specify the modification time in the header. The first argument is the filename, which is also included in the header, so you want to keep that. If you provide a fourth argument for the original data, then the first argument is only used to fill in the header part of the file name.



+2


source


gzip is unstable, as you understood correctly:



[root@dev1 ~]# touch a b
[root@dev1 ~]# gzip a
[root@dev1 ~]# gzip b
[root@dev1 ~]# md5sum a.gz b.gz
8674e28eab49306b519ec7cd30128a5c  a.gz
4974585cf2e85113f1464dc9ea45c793  b.gz

      

-1


source







All Articles