Creating a list of duplicate filenames with Perl

Question

Creating a list of duplicate filenames with Perl

I'm trying to write a script to preprocess some long lists of files, but I'm not sure (nor competent) with Perl yet and am not getting the results I want.

Below is the script but I am stuck with checking for duplicates and would be grateful if someone could tell me where I am going wrong. The block dealing with duplicates appears to have the same shape as the examples I found, but it doesn't work.

#!/usr/bin/perl
use strict;
use warnings;

open my $fh, '<', $ARGV[0] or die "can't open: $!";

foreach my $line (<$fh>) {

#   Trim list to remove directories which do not need to be checked
    next if $line =~ m/Inventory/;
#   MORE TO DO 
    next if $line =~ m/Scanned photos/;

    $line =~ s/\n//; # just for a tidy list when testing
    my @split = split(/\/([^\/]+)$/, $line); # separate filename from rest of path

    foreach (@split) {
        push (my @filenames, "$_");
#       print "@filenames\n"; # check content of array

        my %dupes;

        foreach my $item (@filenames) {
            next unless $dupes{$item}++;
            print "$item\n";
        }
    } 
}

I am trying to figure out what is wrong with my check for duplicates. I know the array contains duplicates (uncommenting the first print function gives me a list with a lot of duplicates). The code as it stands does not generate anything.

Not the main purpose of my post, but my ultimate goal is to remove unique filenames from the list and keep filenames that are duplicated in other directories. I know none of these files are identical, but there are many different versions of the same file, so I am focusing on the filename.

For example, I need an input:

~ / Pictures / 2010 / 12345678.jpg ~ / Pictures / 2010 / 12341234.jpg ~ / Desktop / temperature / 12345678.jpg

to get the output:

~ / Pictures / 2010 / 12345678.jpg ~ / Desktop / temperature / 12345678.jpg

So, I suppose it would ideally be a good idea to check the uniqueness of a match based on a regex without splitting, if possible.

+3

arrays perl duplicates

yvf5rcuya4 21 jan. 13 at 13:46

source to share

2 answers

TLP's answer provides a lot of good advice. Besides:

Why use both an array and a hash to store filenames? Just use the hash as one storage and you will automatically remove duplicates. i.e:

my %filenames; #outside of the loops

...

foreach (@split) {
    $filenames{$_}++;
}

Now, when you want to get a list of unique file names, just use keys %filenames

, or if you want them in alphabetical order sort keys %filenames

. And the value for each hash key is the number of occurrences, so you can find out which ones were duplicated if you don't care.

+3

user1919238 21 jan. 13 at 14:22

source to share

TLP · Accepted Answer · 2013-01-21T13:57:38+0000

This loop below does nothing as the hash and array only contain one value for each iteration of the loop:

foreach (@split) {
    push (my @filenames, "$_");        # add one element to lexical array
    my %dupes;
    foreach my $item (@filenames) {    # loop one time
        next unless $dupes{$item}++;   # add one key to lexical hash
        print "$item\n";
    }
}                                      # @filenames and %dupes goes out of scope

A lexical variable (declared with my

) has a scope that extends to the surrounding block { ... }

, in this case your loop foreach

. When they go out of scope, they are reset and all data is lost.

I don't know why you are copying the filenames from @split

to @filenames

, it seems very redundant. By deduplication, this would be:

my %seen;
my @uniq;

@uniq = grep !$seen{$_}++, @split;

Additional Information:

You may also be interested in using File::Basename

to get the filename:

use File::Basename;
my $fullpath = "~/Pictures/2010/12345678.jpg";
my $name = basename($fullpath);                  # 12345678.jpg

Your replacement

$line =~ s/\n//;

Probably,

chomp($line);

When you read from a file descriptor, using for

( foreach

) means that you are reading all lines and storing them in memory. In most cases it is preferable to use while

, for example:

while (my $line = <$fh>)

Creating a list of duplicate filenames with Perl

More articles: