How can I optimize the performance of an image comparison script?

I wrote a script that compares a huge set of images (over 4500 files) against each other using a mean square comparison. First it resizes each image to 800x600 and takes a histogram. After that, it creates an array of combinations and distributes them evenly across four threads that calculate the mean square of the root of each combination. Images with RMS below 500 will be moved to folders, which will be sorted manually later.

#!/usr/bin/python3

import sys
import os
import math
import operator
import functools
import datetime
import threading
import queue
import itertools
from PIL import Image


def calc_rms(hist1, hist2):
    return math.sqrt(
        functools.reduce(operator.add, map(
            lambda a, b: (a - b) ** 2, hist1, hist2
        )) / len(hist1)
    )


def make_histogram(imgs, path, qout):
    for img in imgs:
        try:
            tmp = Image.open(os.path.join(path, img))
            tmp = tmp.resize((800, 600), Image.ANTIALIAS)
            qout.put([img, tmp.histogram()])
        except Exception:
            print('bad image: ' + img)
    return


def compare_hist(pairs, path):
    for pair in pairs:
        rms = calc_rms(pair[0][1], pair[1][1])
        if rms < 500:
            folder = 'maybe duplicates'
            if rms == 0:
                folder = 'exact duplicates'
            try:
                os.rename(os.path.join(path, pair[0][0]), os.path.join(path, folder, pair[0][0]))
            except Exception:
                pass
            try:
                os.rename(os.path.join(path, pair[1][0]), os.path.join(path, folder, pair[1][0]))
            except Exception:
                pass
    return


def get_time():
    return datetime.datetime.now().strftime("%H:%M:%S")


def chunkify(lst, n):
    return [lst[i::n] for i in range(n)]


def main(path):
    starttime = get_time()
    qout = queue.Queue()
    images = []
    for img in os.listdir(path):
        if os.path.isfile(os.path.join(path, img)):
            images.append(img)
    imglen = len(images)
    print('Resizing ' + str(imglen) + ' Images ' + starttime)
    images = chunkify(images, 4)
    threads = []
    for x in range(4):
        threads.append(threading.Thread(target=make_histogram, args=(images[x], path, qout)))

    [x.start() for x in threads]
    [x.join() for x in threads]

    resizetime = get_time()
    print('Done resizing ' + resizetime)

    histlist = []
    for i in qout.queue:
        histlist.append(i)

    if not os.path.exists(os.path.join(path, 'exact duplicates')):
        os.makedirs(os.path.join(path, 'exact duplicates'))
    if not os.path.exists(os.path.join(path, 'maybe duplicates')):
        os.makedirs(os.path.join(path, 'maybe duplicates'))

    combinations = []
    for img1, img2 in itertools.combinations(histlist, 2):
        combinations.append([img1, img2])

    combicount = len(combinations)
    print('Going through ' + str(combicount) + ' combinations of ' + str(imglen) + ' Images. Please stand by')
    combinations = chunkify(combinations, 4)

    threads = []

    for x in range(4):
        threads.append(threading.Thread(target=compare_hist, args=(combinations[x], path)))

    [x.start() for x in threads]
    [x.join() for x in threads]

    print('\nstarted at ' + starttime)
    print('resizing done at ' + resizetime)
    print('went through ' + str(combicount) + ' combinations of ' + str(imglen) + ' Images')
    print('all done at ' + get_time())

if __name__ == '__main__':
    main(sys.argv[1]) # sys.argv[1] has to be a folder of images to compare

      

This works, but the comparison is done within a few hours, after the changes are completed within 15-20 minutes. At first I assumed it was the blocking queue from which the workers got their combinations, so I replaced it with predefined pieces of the array. This did not decrease the execution time. I also ran it without moving files to rule out a possible hard drive issue.

Profiling using cProfile provides the following output.

Resizing 4566 Images 23:51:05
Done resizing 00:05:07
Going through 10421895 combinations of 4566 Images. Please stand by

started at 23:51:05
resizing done at 00:05:07
went through 10421895 combinations of 4566 Images
all done at 03:09:41
         10584539 function calls (10584414 primitive calls) in 11918.945 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     16/1    0.001    0.000 11918.945 11918.945 {built-in method exec}
        1    2.962    2.962 11918.945 11918.945 imcomp.py:3(<module>)
        1   19.530   19.530 11915.876 11915.876 imcomp.py:60(main)
       51 11892.690  233.190 11892.690  233.190 {method 'acquire' of '_thread.lock' objects}
        8    0.000    0.000 11892.507 1486.563 threading.py:1028(join)
        8    0.000    0.000 11892.507 1486.563 threading.py:1066(_wait_for_tstate_lock)
        1    0.000    0.000 11051.467 11051.467 imcomp.py:105(<listcomp>)
        1    0.000    0.000  841.040  841.040 imcomp.py:76(<listcomp>)
 10431210    1.808    0.000    1.808    0.000 {method 'append' of 'list' objects}
     4667    1.382    0.000    1.382    0.000 {built-in method stat}

      

The complete output of the profiler can be found here .

Given the fourth line, I am assuming that the threads are blocking somehow. But why and why exactly 51 times regardless of the number of images?

I am running this on Windows 7 64-bit.

Thanks in advance.

+3


source to share


2 answers


One of the main problems is that you are using threads to do work that is at least partially CPU-bound. Because of the Global Interpreter Lock, only one CPython thread can run at a time, which means you cannot take advantage of multiple processor cores. This will provide multi-threaded performance for CPU-bound tasks that are at best no different from single-cores and probably even worse because of the additional overhead added by threads. This is noted in the threading

documentation
:

CPython implementation details: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at a time (although some performance-oriented libraries may have this limitation). If you want your application to make better use of the computing resources of multi-core machines, you are recommended to use multiprocessing

. However, threads are still a good model if you want to run multiple I / O tasks at the same time.

To get around the GIL limitations, you should do it as the docs say, and use library multiprocessing

instead threading

:



import multiprocessing
...

qout = multiprocessing.Queue()

for x in range(4):
    threads.append(multiprocessing.Process(target=make_histogram, args=(images[x], path, qout)))

...
for x in range(4):
    threads.append(multiprocessing.Process(target=compare_hist, args=(combinations[x], path)))

      

As you can see, it is multiprocessing

for the most part a substitute for replacement with threading

, so the changes shouldn't be too complicated. The only complication would be if any of the arguments you passed between the processes were not ripped off, although I think they are all in your case. There is also an increased IPC cost for sending Python data structures between processes, but I suspect the benefits of truly parallel computing will outweigh the additional cost.

All that said, you might be somewhat I / O tied here because you rely on disk read / write. Parallelization will not lead to faster disk I / O, so there isn't much there.

+2


source


With 4500 images for comparison, I would suggest file-level multiprocessing rather than (necessarily) intra-image multithreading. As @dano pointed out, the GIL will get in the way of this. My strategy:

  • one worker process per core (or configured number);
  • one orchestration process that overrides the above; does some IASC to coordinate jobs with workers.

Looking (briefly) at your code looks like it would be useful for a lazy language; I can't see what is trying to short circuits. For example, if you are comparing the RMS for each segment of the image, you can stop comparing as soon as you are done comparing the chunks, as soon as you determine that they are quite different. You can also change the way you iterate through the chunks and the size / shape of the chunks.



Also, I would like to consider cheaper mechanisms that do not allow for many square roots; perhaps using something that creates an "approximate" square root, perhaps using a lookup table.

If I am not mistaken, you can also create an intermediate shape (histogram), which you must save temporarily. No need to save 800x600 image.

It would also be helpful to know what you mean by "peer" in relation to this exercise.

0


source







All Articles