Saving a file efficiently using a hash in Django
I am working on a Django project. What I want the user to be able to do is upload the file (via a form) and then save the file locally to the user's path and using the custom filename - its hash. The only solution I can think of is to use the "upload_to" argument of the FileField used. What does this mean (I think):
1) Burn the file to disk
2) Calculate the hash
3) Reverse path + hash as filename
The problem is that there are two write operations, one when saving the file from memory to disk to compute the hash, and the other when saving the file to the specified location.
Is there a way to override FileField's save to disk method (or where can I find what's going on behind the scenes) so that I can basically save the file using a temporary name and then rename it to a hash instead of having it saved twice ...
Thank.
source to share
The parameter upload_to
FileField
takes the callable code, and the string returned from that is concatenated with your parameter MEDIA_ROOT
to get the final filename (from:
It can also be called, for example, a function that will be called to get the download path, including the filename. This callee must be able to take two arguments and return a Unix-style path (with a forward slash) to be passed along with the storage system. The two arguments to be passed are as follows:
instance
: the instance of the model that defines the FileField. More specifically, it is the specific instance where the current file is attached. In most cases, this object will not yet be stored in the database, so if it uses the default AutoField, it may not yet have a value for its primary key field.filename
: the name of the file that was originally provided to the file. This may or may not be taken into account when determining the final destination path.
Also, when accessed, model.my_file_field
it decides on an instance FieldFile
that acts like a file. So you should write upload_to
like this:
def hash_upload(instance, filename):
instance.my_file.open() # make sure we're at the beginning of the file
contents = instance.my_file.read() # get the contents
fname, ext = os.path.splitext(filename)
return "{0}_{1}{2}".format(fname, hash_function(contents), ext) # assemble the filename
Replace the appropriate hash function you want to use. Saving to disk is not required at all (in fact, the file is often already loaded into temporary storage or in the case of smaller files stored in memory).
You would use it like:
class MyModel(models.Model):
my_file = models.FileField(upload_to=hash_upload,...)
I haven't tested this yet, so you might need to pop the line that reads the entire file (and you can just use the hash of the first chunk of the file to prevent malicious users from downloading massive files and causing DoS attacks). You can get the first piece with
instance.my_file.read(instance.my_file.DEFAULT_CHUNK_SIZE)
.
source to share
Updated answer to at least 1.10:
- Yours
instance.my_file_field
is an instance of UploadedFile , not a file-like object - It cannot be opened or closed, only read and possibly in chunks
- Calling read () can unconditionally consume all available physical memory
In the following example, the instance has a class method "get_image_basedir" because there are several models that use the same function but require a different base directory. I left this as it is a common pattern. HASH_CHUNK_SIZE is a variable that I have set and chosen to optimize disk reading (ie, match the block size of a file system or several of it).
def get_image_path(instance, filename):
import os.path
import hashlib
base = instance.get_image_basedir()
parts = os.path.splitext(filename)
ctx = hashlib.sha256()
if instance.img.multiple_chunks():
for data in instance.img.chunks(HASH_CHUNK_SIZE):
ctx.update(data)
else:
ctx.update(instance.img.read())
return os.path.join(base, ctx.hexdigest() + parts[1])
source to share