Php: speed up levensthein comparison, 10k + records

in my mysql table I have a field name that is unique. However, the contents of the field are collected in different places. So, maybe I have 2 entries with a very similar name, not the second one that is being called due to spelling errors.

Now I want to find those records that are very similar to others. To do this, I go through all of my entries and compare the name with other entries, repeating all entries again. The problem is that there are over 15 thousand records that are taking too long. Is there a way to make it faster?

this is my code:

for($x=0;$x<count($serie1);$x++)
    {
    for($y=0;$y<count($serie2);$y++)
        {
        $sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
        if($sim==1)
            print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
        }
     }`

      

+3


source to share


2 answers


Preamble: Such a task will always be labor intensive and there will always be fumes slipping through. However, a few ideas:

1.in fact, the algorithm could improve (slightly)

assuming $series1

u $series2

have the same values ​​in the same order, you don't have to iterate over the entire second array in the inner loop every loop. In this case, you only need to evaluate each pair of values ​​once - levenshtein('a', 'b')

enough, you also don't need to levenshtein('b', 'a')

(and you don't need to either levenstein('a', 'a')

)

under these assumptions, you can write your function like this:

for($x=0;$x<count($serie1);$x++)
{
   for($y=$x+1;$y<count($serie2);$y++) // <-- $y doesn't need to start at 0
    {
      $sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
      if($sim==1)
        print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
    }
 }

      

2.possibly MySQL is faster

there are examples on the net for implementing levenshtein () as a MySQL function. See example on SO here: How to add levenshtein function to mysql?



If you're comfortable with complex (ish) SQL, you can delegate the heavy lifting to MySQL and at least get some performance, because you're not collecting whole 16k rows at runtime PHP.

3.do not do everything at once / save your results

of course, you have to run the function once for every entry, but after the first run, you only need to check for new entries since the last run. Schedule a chronjob that once every day / week / month .. checks for all new entries. You will need a column inserted_at

in your table and still need to compare new names with every other record.

3.5 do some work onInsert

a) if the expectation is acceptable, do a validation as soon as a new entry is inserted so that you either write it to the log or give direct user feedback. (Tangent: this could be a good use case for an async task queue like http://gearman.org/ -> start a new background check process, return success message to insert right away)

b) PHP has two other functions to help you find strings that are almost similar: metaphone () and soundex () . These functions generate abstract hashes that represent how the string will sound when spoken. You can generate (one or both) these hashes for each insert, store them as a separate field in your table, and use simple SQL functions to find records with similar hashes

+2


source


The problem with levenshtein is that it only compares string a to string b. I built a spelling corrector where it inserts all lines in a big trick and it works like a dictionary. It will then search for any string b in that dictionary, finding all the nearest matching words. I did it first in Fortran (!), Then in Pascal. This would be easiest in a more modern language, but I suspect php won't make things easier. Look at here.



0


source







All Articles