Algorithm for fuzzy conjugation of strings from two lists
Suppose you have two lists of strings containing similar items, with changes (for example, List 1: Apples, fruit_b, orange; List2: Fruit, apples, banana, orange_juice).
Given a distance metric such as Levenshtein distance, what are good algorithms for finding the optimal pairing, that is, a pairing that minimizes the sum of the distances for all pairings?
The result corresponding to my example would look like this:
Apples - apples
fruits_b - Fruit
orange - orange_juice
- banana
Sub question: is there any tool that already implements this or something similar?
+3
source to share
1 answer
OK, here's my python solution using Levenshtein distance and Hungarian algorithm (both provided by external packages):
from munkres import Munkres
from Levenshtein import distance
from sys import argv
if __name__ == '__main__':
if len(argv) < 3:
print("Usage: fuzzy_match.py file file")
print("Finds the best pairing of lines from the two input files")
print("using the Levenshtein distance and the Hungarian algorithm")
w1 = [l.strip() for l in open(argv[1]).readlines()]
w2 = [l.strip() for l in open(argv[2]).readlines()]
if len(w1) != len(w2):
if len(w2) > len(w1):
w1, w2 = w2, w1
w2.extend([""]*(len(w1)-len(w2)))
matrix = []
for i in w1:
row = []
for j in w2:
row.append(distance(i.lower(), j.lower()))
matrix.append(row)
m = Munkres()
max_length = max(len(w) for w in w1)
for i, j in m.compute(matrix):
print(("{:<%d}{}" % (max_length+10)).format(w1[i], w2[j]))
It works very nicely. I'm still curious if anyone can come up with a better algorithm!
+3
source to share