Compare images and remove duplicates
I have two image folders, they are all PNG. One folder is a copy of another with modified images and some added. The file names are the same, but the contents of the image may be different. Unfortunately, other attributes such as timestamps are completely random.
I want the new folder to remove duplicates (by content) and only keep updated and new ones.
I have installed ImageMagick to use the compare command, but I cannot figure it out. :-( Can you help me? Thanks in advance!
Added: I am on Mac OS X.
source to share
You are not saying that if you are on OSX / Linux or Windows, I can start with you. ImageMagick can compute a hash (checksum) of all pixel data in an image regardless of date or timestamp such as
identify -format "%# %f\n" *.png
25a3591a58550edd2cff65081eab11a86a6a62e006431c8c4393db8d71a1dfe4 blue.png
304c0994c751e75eac86bedac544f716560be5c359786f7a5c3cd6cb8d2294df green.png
466f1bac727ac8090ba2a9a13df8bfb6ada3c4eb3349087ce5dc5d14040514b5 grey.png
042a7ebd78e53a89c0afabfe569a9930c6412577fcf3bcfbce7bafe683e93e8a hue.png
d819bfdc58ac7c48d154924e445188f0ac5a0536cd989bdf079deca86abb12a0 lightness.png
b63ad69a056033a300f23c31f9425df6f469e79c2b9f3a5c515db3b52c323a65 montage.png
a42a5f0abac3bd2f6b4cbfde864342401847a120dacae63294edb45b38edd34e red.png
10bf63fd725c5e02c56df54f503d0544f14f754d852549098d5babd8d3daeb84 sample.png
e95042f227d2d7b2b3edd4c7eec05bbf765a09484563c5ff18bc8e8aa32c1a8e sat.png
So, if you do this in each folder, you will have checksums of all files with their names next to them in a separate file for each folder.
If you then combine the two files and sort them, you can easily find duplicates as the duplicated files will appear next to each other.
Let's say you run the above command in two folders dira
and dirb
like this
cd dira
identify -format "%# %f\n" *.png > $HOME/dira
cd dirb
identify -format "%# %f\n" *.png > $HOME/dirb
Then you can do something like this in awk
awk 'FNR==NR{name[$1]=$2;next}
{
if($1 in name){print $2 " duplicates " name[$1]}
}' $HOME/dir*
So the part $HOME/dir*
transfers both files to awk
. The chunk {}
after FNR==NR
only applies to the first file that is read, and as we read it, we store an associative array indexed by a hash containing the filenames. Then, in the second pass, we check if each hash has been seen, and if there is one, we say it is a duplicate and output the name that we found in the first pass from the hash name[]
, and the name that we found in the second pass from $ 2.
This won't work with filenames with spaces in them, so if that's a problem, change the command identify
to put a colon between the hash and the filename like this:
identify -format "%#:%f\n" *.png
and change awk
to awk -F":"
and it should work again.
source to share
Here is my ugly Powershell solution (it's now a multi-platform solution) - I wrote it for one-time use, but it should work. I tried to comment on this a bit to compensate for how bad it is.
Nevertheless,Id backs up your images. Just in case.
The catch here is that it detects if each file is a duplicate of the previous one - if you need to check if each file is a duplicate of any other, you want to nest another loop in it for()
, which should be simple enough.
#get the list of files with imagemagick
#powershell handily populates $files as an array, split by line
#this will take a bit
$files = identify -format "%# %f\n" *.png
$arr = @()
foreach($line in $files) {
#add 2 keys to the new array per line (hash and then filename)
$arr += @($line.Split(" "))
}
#for every 2 keys (eg each hash)
for($i = 2; $i -lt $arr.Length; $i += 2) {
#compare it to the last hash
if($arr[$i] -eq $arr[$i-2]) {
#print a helpful message and then delete
echo "$($arr[$i].Substring(0,16)) = $($arr[$i-2].Substring(0,16)) (removing $($arr[$i+1]))"
remove-item ($arr[$i+1])
}
}
Bonus: to remove any images with a specific hash (all black 640x480 png in my case):
for($i = 2; $i -lt $arr.Length; $i += 2) {
if($arr[$i] -eq "f824c1a8a1128713f17dd8d1190d70e6012b509606d986e7a6c81e40b628df2b") {
echo "$($arr[$i+1])"
remove-item ($arr[$i+1])
}
}
Double bonus: C code to check if the written image collides with a given hash in the folder hash/
and deletes it if written so for Windows / MinGW, but shouldn't be too hard to port if needed. Might be overkill, but I figured I'd throw it in there in case it is useful to everyone.
char filename[256] = "output/UNINITIALIZED.ppm";
unsigned long int timeint = time(NULL);
sprintf(filename, "../output/image%lu.ppm", timeint);
if(
writeppm(
filename,
SCREEN_WIDTH,
SCREEN_HEIGHT,
screenSurface->pixels
) != 0
) {
printf("image write error!\n");
return;
}
char shacmd[256];
sprintf(shacmd, "sha256sum %s", filename);
FILE *file = popen(shacmd, "r");
if(file == NULL) {
printf("failed to get image hash!\n");
return;
}
//the hash is 64 characters but we need a 0 at the end too
char sha[96];
int i;
char c;
//get hash until the first space
for(i = 0; (i < 64) && (c != EOF) && (c != 0x32); i++) {
sha[i] = c = fgetc(file);
}
pclose(file);
char hashfilename[256];
sprintf(hashfilename, "../output/hash/%s", sha);
if(_access(hashfilename, 0) != -1) {
//file exists, delete img
if(unlink(filename) != 0) {
printf("image delete error!\n");
}
} else {
FILE *hashfile = fopen(hashfilename, "w");
if(hashfile == NULL)
printf("hash file write error!\nfilename: %s\n", hashfilename);
fclose(hashfile);
}
source to share