Cheksum image file as unique content comparison optimization

Question

Cheksum image file as unique content comparison optimization

Users upload photos to our php build system. We mark some of them as prohibited due to inappropriate content. I am looking for an optimization of the "AUTO-COMPARE" algorithm that skips these flagged photographs. Each download must be compared against many vorbindens.

Possible solutions:

1 / Store prohibited files and compare all content - works well, but slower.

2 / Store the checksum of the image file and compare checksums is an idea to improve speed.

3 / Any smart algorithm that is fast enough to compare similarities between photos. But I don't have any ideas related to them in PHP.

What's the best solution?

+2

php image checksum

Vaclav kohout 04 Sep '09 at 7:36

source to share

3 answers

Comparing image similarity is not a trivial issue, so unless you really want to put a lot of effort into image comparison algorithms, your idea of generating some sort of hash of image data and comparing them will at least allow you to find exact duplicates quickly. I would go with your current plan, but make sure it has a decent (but fast) hash so the chances of collisions are low.

+2

Amber 04 Sep '09 at 7:40

source to share

The problem with hashes, as suggested, is that if someone changes 1 pixel, the hash is completely different.

There are some great frameworks that can compare the contents of a file and return (in percentage) how much the same looks the same. There is one in a particular command line application that I ran into once, which was built in a scientific environment, and was open source, but I can't remember its name.

Such a structure can definitely help you as they can be extremely fast even with a lot of files.

+1

Jake 04 Sep '09 at 9:28

source to share

Wim ten brink · Accepted Answer · 2009-09-04T07:50:20+0000

Don't calculate checksums, calculate hashes!

I once created a simple application that had to look for duplicate images on my hard drive. It will only search for .JPG files, but for each file, I would compute the hash value for the first 1024 bytes and then add the width, height and size of the image to it to get a string like: "875234: 640: 480: 13286" which I would use as a key for the image. As it turns out, I haven't seen any false duplicates with this algorithm, although there is still the possibility of false duplicates. However, this scheme will allow for duplication when someone just adds one byte to it or makes very small adjustments to the image.

Another trick might be to reduce the size and number of colors of each image. If you resize each image to 128x128 pixels and reduce the number of colors to 16 (4 bits), then you get reasonable unique patterns of 8192 bytes each. Compute the hash value using this pattern and use the hash as the primary key. Once you get hit, you may have a false positive, so you will need to compare the template of the new image with the template stored on your system. This comparison pattern can be used if the first hash solution indicates that the new image is unique. However, I still need to figure out my own tool. But it's basically a kind of prints of images and then comparing them.

My first solution will find exact matches. My second solution will find similar images. (By the way, I wrote my hash method in Delphi, but technically, any hash method would be good enough.)

Cheksum image file as unique content comparison optimization

More articles: