Check for typos comparing two strings in T-SQL

Question

Check for typos comparing two strings in T-SQL

We have developed a series of business rules that define a duplicate contact record. The core of these rules is centered around first checking the same name and then comparing other fields like phone number, email, phone, etc.

The problem is that only a small percentage of records are committed and automatically flushed / merged.

In order to record more entries, I would like to include or check typos in the name of the contacts (eg Michael = Micheal).

Is there a nice feature I can use to check for typos to get more accurate results? I would have thought that a function that looks for the difference in one character by comparing two strings would do the trick.

+3

tsql pattern-matching data-scrubbing

Benzine Feb 19 '13 at 3:34

source to share

1 answer

mjv · Accepted Answer · 2013-02-19T04:26:44+0000

Keep in mind that most algorithms for measuring string similarity are computationally intensive and, depending on the size of the job being performed, T-SQL can be a poor choice in terms of performance.

Instead of measuring similarity across strings, consider hash functions , in particular those that preserve the basic "structure" of words. The advantage of hash codes is that they are evaluated only once, using only one string as input, and can then be used in [TSQL] filters with a simple equality predicate (as opposed to similarity measurements, which imply that you run algorithm for every possible reference string). A plausible SOUNDEX hash code proposal that is particularly well suited to typical variations of personality and company names, and which is also implemented "natively" as a TSQL Function .

It would probably be preferable to compute the soundex code for every single word in the name field, for example to generate two input codes such as "Charles Darwin", three for "Jean Jacques Rousseau", etc. and to improve performance, you may need to find a way to differentiate the last name from the given first name in order to facilitate the filter condition.

If you prefer to work with string affinity methods, I've found that either Levenstein distance or Ratcliff / Oberhelp measure work pretty well for solving small changes like typos. As with Soundex, you can still consider word processing separately and then introduce the complexity of dealing with multiple values for a given name notation, but also allows for more aggressive handling of the typical naming situation, resulting in some instances having a first order name then surname and other copies in reverse order (or where parts of the name are omitted or abbreviated).

Check for typos comparing two strings in T-SQL

More articles: