Cropping Levenshtein Distance

Is there a good way to use Levenstein distance to match one specific string to any case within a second longer string?

Example:

str1='aaaaa'
str2='bbbbbbaabaabbbb'

if str1 in str2 with a distance < 2:
    return True

      

So, in the above example, the part of string 2 is aabaa

and distance(str1,str2) < 2

, so the operator should return True

.

The only way I can do this is to take 5 characters from str2 at a time, compare that to str1, and then repeat that traversal over str2. Unfortunately this seems really inefficient and I need to process a lot of data this way.

+3


source to share


3 answers


The trick is to generate all substrings of the corresponding length b

and then compare them.



def lev_dist(a,b):
    length_cost = abs(len(a) - len(b))
    diff_cost = sum(1 for (aa, bb) in zip(a,b) if aa != bb)
    return diff_cost + length_cost

def all_substr_of_length(n, s):
    if n > len(s):
        return [s]
    else:
        return [s[i:i+n] for i in range(0, len(s)-n+1)]

def lev_substr(a, b):
    """Gives minimum lev distance of all substrings of b and
    the single string a.
    """

    return min(lev_dist(a, bb) for bb in all_substr_of_length(len(a), b))

if lev_substr(str1, str2) < 2:
    # it works!

      

+2


source


Maybe you look at nofollow noreferrer -> which supports fuzzy matching:

>>> import regex
>>> regex.search("(aaaaa){s<2}", 'bbbbbbaabaabbbb')
<regex.Match object; span=(6, 11), match='aabaa', fuzzy_counts=(1, 0, 0)>

      



Since you're looking for strings of the same length, you can also do aa Hamming distance , which is probably much faster than Levenshtein's distance by the same two strings:

str1='aaaaa'
str2='bbbbbbaabaabbbb'
for s in [str2[i:i+len(str1)] for i in range(0,len(str2)-len(str1)+1)]:
    if sum(a!=b for a,b in zip(str1,s))<2:
        print s    # prints 'aabaa'

      

+3


source


The trick is usually to play with insertion (for shorter ones) or remove (for longer) costs. You might also consider using Damerau-Levenshtein instead. https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

0


source







All Articles