Perl - how can I match strings that are not quite the same?

I have a list of strings that I want to find in a file. It would be pretty easy if the lines in my list and in the file match exactly. Unfortunately, there are typos and variations of the name. Here is an example of how some of these lines differ

List          File
B-Arrestin    Beta-Arrestin
Becn-1        BECN 1
CRM-E4        CRME4

      

Note that each of these pairs must be considered a match, even though they are different strings. I know that I can categorize all the variations and write a separate REGEX to detect matches, but this is cumbersome enough that I might be better off looking for matches manually. I think the best solution for my problem would be some kind of expression that reads:

"Match this string exactly, but still treat it as a match if there are X characters that don't match."

Is there something like this? Is there another way to match strings that are not exactly the same but close?

+3


source to share


2 answers


As 200_success pointed out , you can perform fuzzy matching with Text::Fuzzy

which calculates Levenshtein distance between text bits. You will need to play with the maximum Levenshtein distance you want to resolve, but if you are doing case-insensitive comparison, the maximum distance in your sample data is three:

use strict;
use warnings;
use 5.010;

use Text::Fuzzy;

my $max_dist = 3;

while (<DATA>) {
    chomp;
    my ($string1, $string2) = split ' ', $_, 2;

    my $tf = Text::Fuzzy->new(lc $string1);
    say "'$string1' matches '$string2'" if $tf->distance(lc $string2) <= $max_dist;
}

__DATA__
B-Arrestin    Beta-Arrestin
Becn-1        BECN 1
CRM-E4        CRME4

      



Output:

'B-Arrestin' matches 'Beta-Arrestin'
'Becn-1' matches 'BECN 1'
'CRM-E4' matches 'CRME4'

      

+5


source


+3


source







All Articles