Saving a file efficiently using a hash in Django

I am working on a Django project. What I want the user to be able to do is upload the file (via a form) and then save the file locally to the user's path and using the custom filename - its hash. The only solution I can think of is to use the "upload_to" argument of the FileField used. What does this mean (I think):

1) Burn the file to disk

2) Calculate the hash

3) Reverse path + hash as filename

The problem is that there are two write operations, one when saving the file from memory to disk to compute the hash, and the other when saving the file to the specified location.

Is there a way to override FileField's save to disk method (or where can I find what's going on behind the scenes) so that I can basically save the file using a temporary name and then rename it to a hash instead of having it saved twice ...

Thank.

+3


source to share


2 answers


The parameter upload_to

FileField

takes the callable code, and the string returned from that is concatenated with your parameter MEDIA_ROOT

to get the final filename (from:

It can also be called, for example, a function that will be called to get the download path, including the filename. This callee must be able to take two arguments and return a Unix-style path (with a forward slash) to be passed along with the storage system. The two arguments to be passed are as follows:

  • instance

    : the instance of the model that defines the FileField. More specifically, it is the specific instance where the current file is attached. In most cases, this object will not yet be stored in the database, so if it uses the default AutoField, it may not yet have a value for its primary key field.
  • filename

    : the name of the file that was originally provided to the file. This may or may not be taken into account when determining the final destination path.

Also, when accessed, model.my_file_field

it decides on an instance FieldFile

that acts like a file. So you should write upload_to

like this:

def hash_upload(instance, filename):
    instance.my_file.open() # make sure we're at the beginning of the file
    contents = instance.my_file.read() # get the contents
    fname, ext = os.path.splitext(filename)
    return "{0}_{1}{2}".format(fname, hash_function(contents), ext) # assemble the filename

      



Replace the appropriate hash function you want to use. Saving to disk is not required at all (in fact, the file is often already loaded into temporary storage or in the case of smaller files stored in memory).

You would use it like:

class MyModel(models.Model):
    my_file = models.FileField(upload_to=hash_upload,...)

      

I haven't tested this yet, so you might need to pop the line that reads the entire file (and you can just use the hash of the first chunk of the file to prevent malicious users from downloading massive files and causing DoS attacks). You can get the first piece with instance.my_file.read(instance.my_file.DEFAULT_CHUNK_SIZE)

.

+3


source


Updated answer to at least 1.10:

  • Yours instance.my_file_field

    is an instance of UploadedFile , not a file-like object
  • It cannot be opened or closed, only read and possibly in chunks
  • Calling read () can unconditionally consume all available physical memory


In the following example, the instance has a class method "get_image_basedir" because there are several models that use the same function but require a different base directory. I left this as it is a common pattern. HASH_CHUNK_SIZE is a variable that I have set and chosen to optimize disk reading (ie, match the block size of a file system or several of it).

def get_image_path(instance, filename):
    import os.path
    import hashlib
    base = instance.get_image_basedir()
    parts = os.path.splitext(filename)
    ctx = hashlib.sha256()
    if instance.img.multiple_chunks():
        for data in instance.img.chunks(HASH_CHUNK_SIZE):
            ctx.update(data)
    else:
        ctx.update(instance.img.read())
    return os.path.join(base, ctx.hexdigest() + parts[1])

      

+3


source







All Articles