Python String Comparison Precision

Question

Python String Comparison Precision

I am trying to compare two lists of data that have some free text denoting the same object. Example

List 1 ['abc LLC','xyz, LLC']
List 2 ['abc , LLC','xyz LLC']

This is a simple example, but the problem is that there could be many changes, such as changing the case or adding some "." between. Is there any python package out there that can do the comparison and give a similarity score?

+3

python

Raman Narayanan 04 Apr 12 at 7:48

source to share

2 answers

there is an excellent binary library that uses levenshtein distance (edit distance) between lines to evaluate similarity. Try:

https://github.com/miohtama/python-Levenshtein

+3

Not_a_Golfer 04 Apr 12 at 8:21

source to share

AKX · Accepted Answer · 2012-04-04T07:54:38+0000

You can use an implementation of Levenshtein Distance algorithm for inexact string matching, like this one from Wikibooks .

Another option would be, for example, reset everything to lowercase, remove spaces, etc. before the original comparison - this of course depends on your use case:

import string, unicodedata
allowed = string.letters + string.digits
def fold(s):
  s = unicodedata.normalize("NFKD", unicode(s).lower()).encode("ascii", "ignore")
  s = "".join(c for c in s if c in allowed)
  return s

for example in ['abc LLC','xyz, LLC', 'abc , LLC','xyz LLC']:
  print "%r -> %r" % (example, fold(example))

will print

'abc LLC' -> 'abcllc'
'xyz, LLC' -> 'xyzllc'
'abc , LLC' -> 'abcllc'
'xyz LLC' -> 'xyzllc'

Python String Comparison Precision

More articles: