Python String Comparison Precision

I am trying to compare two lists of data that have some free text denoting the same object. Example

List 1 ['abc LLC','xyz, LLC']
List 2 ['abc , LLC','xyz LLC']

      

This is a simple example, but the problem is that there could be many changes, such as changing the case or adding some "." between. Is there any python package out there that can do the comparison and give a similarity score?

+3
python


source to share


2 answers


You can use an implementation of Levenshtein Distance algorithm for inexact string matching, like this one from Wikibooks .

Another option would be, for example, reset everything to lowercase, remove spaces, etc. before the original comparison - this of course depends on your use case:

import string, unicodedata
allowed = string.letters + string.digits
def fold(s):
  s = unicodedata.normalize("NFKD", unicode(s).lower()).encode("ascii", "ignore")
  s = "".join(c for c in s if c in allowed)
  return s

for example in ['abc LLC','xyz, LLC', 'abc , LLC','xyz LLC']:
  print "%r -> %r" % (example, fold(example))

      



will print

'abc LLC' -> 'abcllc'
'xyz, LLC' -> 'xyzllc'
'abc , LLC' -> 'abcllc'
'xyz LLC' -> 'xyzllc'

      

+7


source to share


there is an excellent binary library that uses levenshtein distance (edit distance) between lines to evaluate similarity. Try:



https://github.com/miohtama/python-Levenshtein

+3


source to share







All Articles
Loading...
X
Show
Funny
Dev
Pics