Saving a file efficiently using a hash in Django

Question

Saving a file efficiently using a hash in Django

I am working on a Django project. What I want the user to be able to do is upload the file (via a form) and then save the file locally to the user's path and using the custom filename - its hash. The only solution I can think of is to use the "upload_to" argument of the FileField used. What does this mean (I think):

1) Burn the file to disk

2) Calculate the hash

3) Reverse path + hash as filename

The problem is that there are two write operations, one when saving the file from memory to disk to compute the hash, and the other when saving the file to the specified location.

Is there a way to override FileField's save to disk method (or where can I find what's going on behind the scenes) so that I can basically save the file using a temporary name and then rename it to a hash instead of having it saved twice ...

Thank.

+3

python django hash

Puscasu emanuel 30 jul. 15 at 18:32

source to share

2 answers

Updated answer to at least 1.10:

Yours instance.my_file_field

is an instance of UploadedFile , not a file-like object
It cannot be opened or closed, only read and possibly in chunks
Calling read () can unconditionally consume all available physical memory

In the following example, the instance has a class method "get_image_basedir" because there are several models that use the same function but require a different base directory. I left this as it is a common pattern. HASH_CHUNK_SIZE is a variable that I have set and chosen to optimize disk reading (ie, match the block size of a file system or several of it).

def get_image_path(instance, filename):
    import os.path
    import hashlib
    base = instance.get_image_basedir()
    parts = os.path.splitext(filename)
    ctx = hashlib.sha256()
    if instance.img.multiple_chunks():
        for data in instance.img.chunks(HASH_CHUNK_SIZE):
            ctx.update(data)
    else:
        ctx.update(instance.img.read())
    return os.path.join(base, ctx.hexdigest() + parts[1])

+3

Melvyn 28 jan. 17 at 5:08

source to share

Alex van liew · Accepted Answer · 2015-07-30T20:59:06+0000

The parameter upload_to

FileField

takes the callable code, and the string returned from that is concatenated with your parameter MEDIA_ROOT

to get the final filename (from:

It can also be called, for example, a function that will be called to get the download path, including the filename. This callee must be able to take two arguments and return a Unix-style path (with a forward slash) to be passed along with the storage system. The two arguments to be passed are as follows:

instance

: the instance of the model that defines the FileField. More specifically, it is the specific instance where the current file is attached. In most cases, this object will not yet be stored in the database, so if it uses the default AutoField, it may not yet have a value for its primary key field.

filename

: the name of the file that was originally provided to the file. This may or may not be taken into account when determining the final destination path.

Also, when accessed, model.my_file_field

it decides on an instance FieldFile

that acts like a file. So you should write upload_to

like this:

def hash_upload(instance, filename):
    instance.my_file.open() # make sure we're at the beginning of the file
    contents = instance.my_file.read() # get the contents
    fname, ext = os.path.splitext(filename)
    return "{0}_{1}{2}".format(fname, hash_function(contents), ext) # assemble the filename

Replace the appropriate hash function you want to use. Saving to disk is not required at all (in fact, the file is often already loaded into temporary storage or in the case of smaller files stored in memory).

You would use it like:

class MyModel(models.Model):
    my_file = models.FileField(upload_to=hash_upload,...)

I haven't tested this yet, so you might need to pop the line that reads the entire file (and you can just use the hash of the first chunk of the file to prevent malicious users from downloading massive files and causing DoS attacks). You can get the first piece with instance.my_file.read(instance.my_file.DEFAULT_CHUNK_SIZE)

.

Saving a file efficiently using a hash in Django

More articles: