Creating a list of duplicate filenames with Perl
I'm trying to write a script to preprocess some long lists of files, but I'm not sure (nor competent) with Perl yet and am not getting the results I want.
Below is the script but I am stuck with checking for duplicates and would be grateful if someone could tell me where I am going wrong. The block dealing with duplicates appears to have the same shape as the examples I found, but it doesn't work.
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, '<', $ARGV[0] or die "can't open: $!";
foreach my $line (<$fh>) {
# Trim list to remove directories which do not need to be checked
next if $line =~ m/Inventory/;
# MORE TO DO
next if $line =~ m/Scanned photos/;
$line =~ s/\n//; # just for a tidy list when testing
my @split = split(/\/([^\/]+)$/, $line); # separate filename from rest of path
foreach (@split) {
push (my @filenames, "$_");
# print "@filenames\n"; # check content of array
my %dupes;
foreach my $item (@filenames) {
next unless $dupes{$item}++;
print "$item\n";
}
}
}
I am trying to figure out what is wrong with my check for duplicates. I know the array contains duplicates (uncommenting the first print function gives me a list with a lot of duplicates). The code as it stands does not generate anything.
Not the main purpose of my post, but my ultimate goal is to remove unique filenames from the list and keep filenames that are duplicated in other directories. I know none of these files are identical, but there are many different versions of the same file, so I am focusing on the filename.
For example, I need an input:
~ / Pictures / 2010 / 12345678.jpg ~ / Pictures / 2010 / 12341234.jpg ~ / Desktop / temperature / 12345678.jpg
to get the output:
~ / Pictures / 2010 / 12345678.jpg ~ / Desktop / temperature / 12345678.jpg
So, I suppose it would ideally be a good idea to check the uniqueness of a match based on a regex without splitting, if possible.
source to share
This loop below does nothing as the hash and array only contain one value for each iteration of the loop:
foreach (@split) {
push (my @filenames, "$_"); # add one element to lexical array
my %dupes;
foreach my $item (@filenames) { # loop one time
next unless $dupes{$item}++; # add one key to lexical hash
print "$item\n";
}
} # @filenames and %dupes goes out of scope
A lexical variable (declared with my
) has a scope that extends to the surrounding block { ... }
, in this case your loop foreach
. When they go out of scope, they are reset and all data is lost.
I don't know why you are copying the filenames from @split
to @filenames
, it seems very redundant. By deduplication, this would be:
my %seen;
my @uniq;
@uniq = grep !$seen{$_}++, @split;
Additional Information:
You may also be interested in using File::Basename
to get the filename:
use File::Basename;
my $fullpath = "~/Pictures/2010/12345678.jpg";
my $name = basename($fullpath); # 12345678.jpg
Your replacement
$line =~ s/\n//;
Probably,
chomp($line);
When you read from a file descriptor, using for
( foreach
) means that you are reading all lines and storing them in memory. In most cases it is preferable to use while
, for example:
while (my $line = <$fh>)
source to share
TLP's answer provides a lot of good advice. Besides:
Why use both an array and a hash to store filenames? Just use the hash as one storage and you will automatically remove duplicates. i.e:
my %filenames; #outside of the loops
...
foreach (@split) {
$filenames{$_}++;
}
Now, when you want to get a list of unique file names, just use keys %filenames
, or if you want them in alphabetical order sort keys %filenames
. And the value for each hash key is the number of occurrences, so you can find out which ones were duplicated if you don't care.