How do I recognize the word "text" through a regular expression?

Which perl regular expression matches the "word" in the following filename?

I have a series of filenames where multiple words appear more than once:

john_smith_on_alaska_trip_john_smith_0001.jpg

      

His wife's name is Olga, with an umlaut over o, and there are several other names with diacritics; everything in lower case, in my situation, but not just english az..jpg has been temporarily disabled for other reasons and may be ignored for this discussion.

I want to remove duplicate names / words. Something like this works great in emacs:

s/(\b\w{3,}\b)(.*)(\b\1\b)/\1\2/

      

Run it once, and above: john_smith_on_alaska_trip__smith_0001.jpg

Again: john_smith_on_alaska_trip___0001.jpg

This doesn't work in Perl because it \w

contains _

words as a character. Worse, the anchor \b

is something other than these symbols and is therefore not separated from _

.

My current solution is to replace everything _

with

, take action, and return. But it seems like such a fundamental requirement, I feel like I'm missing something.

Thank.

+3


source to share


2 answers


Use Character Class\p{Alpha}

and Lookbehind and Lookahead Assertions instead of word boundaries to ensure that each word is a whole word instead of a substring:

use strict;
use warnings;

my $file = "john_smith_on_alaska_trip_john_smith_0001_johnsmith.jpg";

1 while $file =~ s{
    (?<!\p{Alpha}) ( \p{Alpha}++ )     # Word surrounded by non-word chars
    .* \K                              # Keep everything before this point
    (?<!\p{Alpha}) \1 (?!\p{Alpha})    # Strip duplicate word 
}{}x;

print "$file\n";

      

Outputs:



john_smith_on_alaska_trip___0001_johnsmith.jpg

      

Live Demo

+2


source


You can use split

to split your string into its component parts and then check for duplicates using a hash:

use strict;
use warnings;

my $string = 'john_smith_on_alaska_trip_john_smith_0001.jpg';
my @words = split /_/, $string;

my %count;
foreach my $word (@words) {
    $word = '' if ++$count{$word} > 1;
}

print join('_', @words), "\n";

      

Output:

john_smith_on_alaska_trip___0001.jpg

      



Alternatively, you can use uniq

from List::MoreUtils

to get unique words, although this will slightly change your result by eliminating consecutive underscores after trip

:

use strict;
use warnings;

use List::MoreUtils 'uniq';

my $string = 'john_smith_on_alaska_trip_john_smith_0001.jpg';
my @words = split /_/, $string;

print join('_', uniq @words), "\n";

      

Output:

john_smith_on_alaska_trip_0001.jpg

      

0


source







All Articles