Optimizing python code to find duplicate strings

Question

Optimizing python code to find duplicate strings

We have a long list with lines (about 18 thousand records). The goal is to find all similar lines and group them according to the maximum similarity. ("a" is a list with strings)

I wrote the following code:

def diff(a, b):
    return difflib.SequenceMatcher(None, a, b).ratio()

dupl = {}

while len(a) > 0:
    k = a.pop()
    if k not in dupl.keys():
        dupl[k] = []
    for i,j in enumerate(a):
            dif = diff(k, j)
            if dif > 0.5:
                dupl[k].append("{0}: {1}".format(dif, j))

This code takes an element from the list and searches for duplicates in the rest of the list. If the similarity is greater than 0.5, a similar string is added to the dict.

Everything works well, but very, very slow due to the length of the "a" list. So I would like to ask if there is a way to somehow optimize this code? Any ideas?

+3

optimization python text search

annndrey 23 Mar 12 at 17:44

source to share

2 answers

If necessary , iterate over many itertools elements to save yourself !

This snippet will move all the possibilities of your string (permutation) and return them according to your original code. I feel like this not in

is a uselessly expensive way to test, and not like a Python one. Permutations have been chosen as this will give you the most access to check a-> b or b-> a for two given strings.

import difflib
import itertools

def diff(a, b):
    return difflib.SequenceMatcher(None, a, b).quick_ratio()

def calculate_ratios(strings):
     dupl = dict()
     for s, t in itertools.permutations(strings, 2):
          try:
               dupl[s].append({t: diff(s,t)})
          except KeyError:
               dupl[s] = []
               dupl[s].append({t: diff(s,t)})
     return dupl

a = ['first string', 'second string', 'third string', 'fourth string']
print calculate_ratios(a)

Depending on your constraints (since the permutations are computationally and spatially redundant), you can replace the permutations with combinations, but then your accessor must be adjusted (since ab will only appear in [b], not b [a]).

I use quick_ratio () in the code, but it just changed to ratio () or real_quick_ratio () depending on your solution, if there is sufficient precision.

And in this case, a simple IF will solve this problem:

import difflib
import itertools

def diff(a, b):
    return difflib.SequenceMatcher(None, a, b).quick_ratio()
def diff2(a, b):
    return difflib.SequenceMatcher(None, a, b).ratio()

def calculate_ratios(strings, threshold):
     dupl = dict()
     for s, t in itertools.permutations(strings, 2):
          if diff(s,t) > threshold: #arbitrary threshhold
               try:
                    dupl[s].append({t: diff2(s,t)})
               except KeyError:
                    dupl[s] = []
                    dupl[s].append({t: diff2(s,t)})
     return dupl

a = ['first string', 'second string', 'third string', 'fourth string']
print calculate_ratios(a, 0.5)

+2

hexparrot 23 Mar 12 at 18:21

source to share

Peter de Rivaz · Accepted Answer · 2012-03-23T20:03:12+0000

A few small optimizations:

You can remove duplicates from the list before starting the search (e.g. a = list (set (a))). So far, if a contains 18k copies of the "hello" line, it will call diff 18k * 18k times.
In short, you will be comparing the number of strings i with string number j, and also string j with string number i. I think they will return the same result so you can only compute one of them and maybe go twice as fast.

Of course, the main problem is that diff calls n * n times for a list of length n, and the ideal solution would be to reduce the number of times diff calls. The approach to use will depend on the content of your strings.

Here are some examples of possible approaches that will be relevant to different cases:

Let's assume the strings are of different lengths. diff will return> 0.5 if line lengths are within 2. In this case, you can sort the input lines by length in O (nlogn) time, and then only compare lines with the same length.
Suppose the strings have word sequences and are expected to be either very different or very similar. You can build an inverted index for words and then only compare strings containing the same unusual words
Let's say you expect rows to fall into a small number of groups. You can try running the K-Means algorithm to group them into clusters. This will take K * n * I, where i is the number of iterations of the K-Means algorithm you choose to use.

If n gets very large (many millions), then this is inappropriate and you may have to use more approximate methods. One example that is used to cluster web pages is called MinHash

Optimizing python code to find duplicate strings

More articles: