Improve the performance of moving a growing large number of files in an installed folder

This is my situation:

A has a Windows share, I mount with mount -t cifs -o username=username,password=password,rw,nounix,iocharset=utf8,file_mode=0777,dir_mode=0777 //192.168.1.120/storage /mnt/storage

This folder contains a very rapidly growing number of files of different sizes (from a few bytes to ~ 20 MB). If not moved / deleted, the number of files in this directory can exceed 10 million.

I need to move batches (in move_script

) of files with a specific name ( *.fext

) from this directory to another directory (currently a subfolder in the directory /mnt/storage/in_progress

).

The script then runs another script ( process_script

) that will process the files in /mnt/storage/in_progress

. When finished, the process_script

files are moved back move_script

to another subdirectory ( /mnt/storage/done

). The move-move continues until the original folder ( /mnt/storage

) contains no more files.

Additional information about the process:

  • the current bottleneck is moving files (files move a little faster than files are created in the directory)

    if len(os.listdir("/mnt/storage") >= batch_size:
        i = 0
        for f in os.listdir("/mnt/storage"):
            if f.endswith(".fext"):
                move("/mnt/storage/+"f","/mnt/storage/in_progress"
                i+=1
            if i==batch_size:
                break
    
          

  • script move / start processing files, waiting for processing to complete

  • file processing /mnt/storage/in_progress

    is faster with 1k-2k file packages.

  • I was trying to increase the number of files being moved. Move 1k first, then if the number of files in the source directory grows, doubles the number of files moved. This slows down the processing of files in process_script

    , but helps keep up with the "file generator" "..

  • I decided to just rename the subdirectory /mnt/storage/in_progress

    after finishing process_script

    before "/mnt/storage/done"+i_counter

    and create a new one /mnt/storage/in_progress

    . I am guessing this will be half the time of the move in the script.

I need to speed up the process to keep up with the file generator. How can I improve the performance of this move operation?

I am open to any suggestion and am ready to completely change my current approach.

edit: the scripts are executed on debian wheezy, so I could theoretically use a subprocess emitting mv

, but I don't know how reasonable it is.

===========================================

edit2: I wrote a script to check the speed difference between different movement methods. First created 1x5GB ( dd if=/dev/urandom of=/mnt/storage/source/test.file bs=100M count=50

), then 100x5MB ( for i in {1..100}; do dd if=/dev/urandom of=/mnt/storage/source/file$i bs=1M count=5

) and finally with 10000x5kB ( for i in {1..100000}; do dd if=/dev/urandom of=/mnt/storage/source/file$i bs=1k count=5

)

from shutil import move
from os import rename
from datetime import datetime
import subprocess
import os

print("Subprocess mv: for every file in directory..")
s = datetime.now()
for f in os.listdir("/mnt/storage/source/"):
    try:
        subprocess.call(["mv /mnt/storage/source/"+str(f)+" /mnt/storage/mv"],shell=True)
    except Exception as e:
        print(str(e))
e = datetime.now()
print("took {}".format(e-s)+"\n")

print("Subprocessmv : directory/*..")
s = datetime.now()
try:
    subprocess.call(["mv /mnt/storage/mv/* /mnt/storage/mvf"],shell=True)
except Exception as e:
    print(str(e))
e = datetime.now()
print("took {}".format(e-s)+"\n")


print("shutil.move: for every file file in directory..")
s = datetime.now()
for f in os.listdir("/mnt/storage/mvf/"):
    try:    
        move("/mnt/storage/mvf/"+str(f),"/mnt/storage/move")
    except Exception as e:
        print(str(e))
e = datetime.now()
print("took {}".format(e-s)+"\n")

print("os.rename: for every file in directory..")
s = datetime.now()
for f in os.listdir("/mnt/storage/move/"):
    try:
        rename("/mnt/storage/move/"+str(f),"/mnt/storage/rename/"+str(f))
    except Exception as e:
        print(str(e))
e = datetime.now()
print("took {}".format(e-s)+"\n")


if os.path.isdir("/mnt/storage/rename_new"):
    rmtree('/mnt/storage/rename_new')
print("os.rename & os.mkdir: rename source dir to destination & make new source dir..")
s = datetime.now()
rename("/mnt/storage/rename/","/mnt/storage/rename_new")
os.mkdir("/mnt/storage/rename/")
e = datetime.now()
print("took {}".format(e-s)+"\n")

      

Which showed that this is not so important. The 5GB file was moved very quickly, which tells me that moving by changing the file table works. Here are the results from 10000 * 5KB files (it seemed like the results depend on the current network workload. For example, the first test mv

took 2m 28 seconds than later with the same files 3m 22s, was also os.rename()

the fastest method most of the time. .):

Subprocess mv: for every file in directory..
took 0:02:47.665174

Subprocessmv : directory/*..
took 0:01:40.087872

shutil.move: for every file file in directory..
took 0:01:48.454184

os.rename: for every file in directory..
rename took 0:02:05.597933

os.rename & os.mkdir: rename source dir to destination & make new source dir..
took 0:00:00.005704

      

+3


source to share


3 answers


You can simplify your code by using a module glob

to list files. But, most likely, the limiting factor is the network. Chances are, the files end up being copied over the network rather than just being moved around. Otherwise, this process will be very fast.



Try using os.rename()

to move files. It may not work on cifs filesystem, but worth a try. This should do the actual rename, not a copy. If that doesn't work, you may need to mount this filesystem in a different way. Or start the move process on a machine where the file system exists.

+2


source


Peel the onion by logging into the CIFS server and checking if it can even move files quickly without copying them.

If you find that there is still a copy, check the mounts on the CIFS server. It is possible that behind the scenes /mnt/storage/in_progress

and / or /mnt/storage/done

in fact there are different file systems or hard drives installed underneath /mnt/storage

but shared through a single CIFS share.



EDIT: An update to the Daedalus Mythos interim benchmark shows that this is unlikely as Daedalus can quickly move a large 5GB file, but cannot quickly move thousands of smaller files.

+1


source


This is a great opportunity to ask the file generator developer to put the files *.fext

in their own subdirectory, for example /mnt/storage/fext_raw

, so you can simply rename the entire directory to /mnt/storage/in_progress

and then recreate /mnt/storage/fext_raw

.

+1


source







All Articles