The sum of the zipfile parts is not equal to the file size

TL; DR . The actual problem is that I am working on something that provides information about the entries in the archive and says "where" is the size in the archive. The example below is similar to my real problem (which contains hundreds of thousands of entries) but highlights the actual problem I am facing. My problem is that in my archive there is an unconventional size value that is not counted (actually used in the overhead for compression, that's my guess). The sum of the parts of my archive (total compressed size of all my entries + expected spacing between them) is less than the actual size of the archive. How can you validate the archive in such a way as to ensure that this hidden service data is understood?

Where I am:

I have a directory containing three files:

  • doc.pdf

  • cat.jpg

  • model.stl

Using a freeware program, I dump them into a zip file: demo.zip

Using python I can test them pretty easily:

info_list= zipfile.ZipFile('demo.zip').infolist()
for i in info_list:
    print i.orig_filename
    print i.compress_size
    print i.header_offset

      

Using this information, we may obtain some information.

The total size of demo.zip is 84469

Compressed Size:

|---------------------|-----------------|---------------|
|      File           | Compressed Size | Header Offset |
|---------------------|-----------------|---------------|
|         doc.pdf     |       21439     |       0       |
|---------------------|-----------------|---------------|
|         cat.jpg     |       48694     |    21495      |
|---------------------|-----------------|---------------|
|       model.stl     |       13870     |    70232      |
|---------------------|-----------------|---------------|

      

I know zipping will result in some space between entries. (So ​​the difference between the sums of the previous input sizes and the header offset for each entry). You can calculate this little "gap":

gap = offset - previous_entry_size - previous_entry_offset

      

I can update my graph to look like this:

|---------------------|-----------------|---------------|---------------|
|      File           | Compressed Size | Header Offset |     'Gap'     |
|---------------------|-----------------|---------------|---------------|
|         doc.pdf     |       21439     |       0       |       0       |
|---------------------|-----------------|---------------|---------------|
|         cat.jpg     |       48694     |    21495      |       56      |
|---------------------|-----------------|---------------|---------------|
|       model.stl     |       13870     |    70232      |       43      |
|---------------------|-----------------|---------------|---------------|

      

Cool. Therefore, now one would expect that the size of demo.zip will be equal to the sum of the sizes of all entries and their spaces. (84102 in the example above).

But this is not the case. So obviously zipping needs headers and information about how the zip happened (and how to decompress). But I am having a problem with how to determine this or access more details on this.

I could just take 84469 - 84102 and say ~ magic zip overhead ~ = 367 bytes. But that seems less ideal because this number is obviously not magic. Is there a way to check the underlying zip data that is occupying this space?

+3


source to share


1 answer


An empty zip file is 22 bytes containing only the end of the Central Directory entry.

In [1]: import zipfile

In [2]: z = zipfile.ZipFile('foo.zip', 'w')

In [3]: z.close()

In [4]: import os

In [5]: os.stat('foo.zip').st_size
Out[5]: 22

      



If the zip file is not empty, for each file you have a central directory header file (at least 46 bytes), and a local file header (at least 30 bytes).

The actual headers are variable in length because the lengths specified do not contain space for the filename that is part of the header.

0


source







All Articles