How do I recognize the word "text" through a regular expression?
Which perl regular expression matches the "word" in the following filename?
I have a series of filenames where multiple words appear more than once:
john_smith_on_alaska_trip_john_smith_0001.jpg
His wife's name is Olga, with an umlaut over o, and there are several other names with diacritics; everything in lower case, in my situation, but not just english az..jpg has been temporarily disabled for other reasons and may be ignored for this discussion.
I want to remove duplicate names / words. Something like this works great in emacs:
s/(\b\w{3,}\b)(.*)(\b\1\b)/\1\2/
Run it once, and above: john_smith_on_alaska_trip__smith_0001.jpg
Again: john_smith_on_alaska_trip___0001.jpg
This doesn't work in Perl because it \w
contains _
words as a character. Worse, the anchor \b
is something other than these symbols and is therefore not separated from _
.
My current solution is to replace everything _
with
, take action, and return. But it seems like such a fundamental requirement, I feel like I'm missing something.
Thank.
source to share
Use Character Class\p{Alpha}
and Lookbehind and Lookahead Assertions instead of word boundaries to ensure that each word is a whole word instead of a substring:
use strict;
use warnings;
my $file = "john_smith_on_alaska_trip_john_smith_0001_johnsmith.jpg";
1 while $file =~ s{
(?<!\p{Alpha}) ( \p{Alpha}++ ) # Word surrounded by non-word chars
.* \K # Keep everything before this point
(?<!\p{Alpha}) \1 (?!\p{Alpha}) # Strip duplicate word
}{}x;
print "$file\n";
Outputs:
john_smith_on_alaska_trip___0001_johnsmith.jpg
source to share
You can use split
to split your string into its component parts and then check for duplicates using a hash:
use strict;
use warnings;
my $string = 'john_smith_on_alaska_trip_john_smith_0001.jpg';
my @words = split /_/, $string;
my %count;
foreach my $word (@words) {
$word = '' if ++$count{$word} > 1;
}
print join('_', @words), "\n";
Output:
john_smith_on_alaska_trip___0001.jpg
Alternatively, you can use uniq
from List::MoreUtils
to get unique words, although this will slightly change your result by eliminating consecutive underscores after trip
:
use strict;
use warnings;
use List::MoreUtils 'uniq';
my $string = 'john_smith_on_alaska_trip_john_smith_0001.jpg';
my @words = split /_/, $string;
print join('_', uniq @words), "\n";
Output:
john_smith_on_alaska_trip_0001.jpg
source to share