Streaming md5sum content of large remote tarball

I need to get a .tar.gz archive from an HTTP server and MD5sum each file it contains. Since the archive is 4.5GB compressed, 12GB is unpacked, I would like to do this without touching the hard drive. Of course, I cannot store everything in RAM.

I'm trying to use python for it, but my problem is that for some strange reason, the tarfile module tries to search () to the end of the input file descriptor - something you can't do with piped streams. Ideas?

import tarfile
import hashlib
import subprocess
URL = 'http://myhost/myfile.tar.gz'

url_fh = subprocess.Popen('curl %s | gzip -cd' % URL, shell=True, stdout=subprocess.PIPE)
tar_fh = tarfile.open(mode='r', fileobj=url_fh.stdout)
for tar_info in tar_fh:
    content_fh = tar_fh.extractfile(tar_info)
    print hashlib.md5(content_fh.read()).hexdigest(), tar_info.name
tar_fh.close()

      

The above failed:

Traceback (most recent call last):
  File "gzip_pipe.py", line 13, in <module>
    tar_fh = tarfile.open(mode='r', fileobj=url_fh.stdout)
  File "/algo/algos2dev4/AlgoOne-EC/third-party-apps/python/lib/python2.6/tarfile.py", line 1644, in open
    saved_pos = fileobj.tell()
IOError: [Errno 29] Illegal seek

      

+3


source to share


1 answer


To find the md5 sums of all files in a remote archive on the fly:



#!/usr/bin/env python
import tarfile
import sys
import hashlib
from contextlib import closing
from functools import partial

try:
    from urllib.request import urlopen
except ImportError: # Python 2
    from urllib2 import urlopen

def md5sum(file, bufsize=1<<15):
    d = hashlib.md5()
    for buf in iter(partial(file.read, bufsize), b''):
        d.update(buf)
    return d.hexdigest()

url = sys.argv[1] # url to download
with closing(urlopen(url)) as r, tarfile.open(fileobj=r, mode='r|*') as archive:
    for member in archive:
        if member.isreg(): # extract only regular files from the archive
            with closing(archive.extractfile(member)) as file:
                print("{name}\t{sum}".format(name=member.name, sum=md5sum(file)))

      

+3


source







All Articles