Algorithm for fuzzy conjugation of strings from two lists

Question

Algorithm for fuzzy conjugation of strings from two lists

Suppose you have two lists of strings containing similar items, with changes (for example, List 1: Apples, fruit_b, orange; List2: Fruit, apples, banana, orange_juice).

Given a distance metric such as Levenshtein distance, what are good algorithms for finding the optimal pairing, that is, a pairing that minimizes the sum of the distances for all pairings?

The result corresponding to my example would look like this:

Apples    - apples
fruits_b  - Fruit
orange    - orange_juice
          - banana

Sub question: is there any tool that already implements this or something similar?

+3

string algorithm levenshtein-distance

static_rtti 18 nov. '14 at 9:25

source to share

1 answer

static_rtti · Accepted Answer · 2014-11-18T11:27:38+0000

OK, here's my python solution using Levenshtein distance and Hungarian algorithm (both provided by external packages):

from munkres import Munkres
from Levenshtein import distance
from sys import argv

if __name__ == '__main__':
    if len(argv) < 3:
        print("Usage: fuzzy_match.py file file")
        print("Finds the best pairing of lines from the two input files")
        print("using the Levenshtein distance and the Hungarian algorithm")
    w1 = [l.strip() for l in open(argv[1]).readlines()]
    w2 = [l.strip() for l in open(argv[2]).readlines()]
    if len(w1) != len(w2):
        if len(w2) > len(w1):
            w1, w2 = w2, w1
        w2.extend([""]*(len(w1)-len(w2)))
    matrix = []
    for i in w1: 
        row = []
        for j in w2:
            row.append(distance(i.lower(), j.lower()))
        matrix.append(row)
    m = Munkres()
    max_length = max(len(w) for w in w1)
    for i, j in m.compute(matrix):
        print(("{:<%d}{}" % (max_length+10)).format(w1[i], w2[j]))

It works very nicely. I'm still curious if anyone can come up with a better algorithm!

Algorithm for fuzzy conjugation of strings from two lists

More articles: