Python String Comparison Precision
I am trying to compare two lists of data that have some free text denoting the same object. Example
List 1 ['abc LLC','xyz, LLC']
List 2 ['abc , LLC','xyz LLC']
This is a simple example, but the problem is that there could be many changes, such as changing the case or adding some "." between. Is there any python package out there that can do the comparison and give a similarity score?
You can use an implementation of Levenshtein Distance algorithm for inexact string matching, like this one from Wikibooks .
Another option would be, for example, reset everything to lowercase, remove spaces, etc. before the original comparison - this of course depends on your use case:
import string, unicodedata
allowed = string.letters + string.digits
def fold(s):
s = unicodedata.normalize("NFKD", unicode(s).lower()).encode("ascii", "ignore")
s = "".join(c for c in s if c in allowed)
return s
for example in ['abc LLC','xyz, LLC', 'abc , LLC','xyz LLC']:
print "%r -> %r" % (example, fold(example))
will print
'abc LLC' -> 'abcllc'
'xyz, LLC' -> 'xyzllc'
'abc , LLC' -> 'abcllc'
'xyz LLC' -> 'xyzllc'
there is an excellent binary library that uses levenshtein distance (edit distance) between lines to evaluate similarity. Try:
https://github.com/miohtama/python-Levenshtein