Is it possible to compare two tables if there is no shared key between them?

I have two tables that I would like to compare for duplicates. These tables are simply the basic information fields of the company, such as name, city, state, etc. The only possible common field I can see would be the name column, but the names are not entirely accurate. Is there a way that I can perform a comparison between the two using the LIKE statement? I am also open to any additional suggestions that everyone may have.

Thank.

+2


source to share


4 answers


I would like to try matching using the Double Metaphone algorithm, which is a more complex SOUNDEX type algorithm.



Here is the MySQL implementation .

+3


source


There are companies that make good money selling data cleaning products that take on this fuzzy match. So it seems incredible that you can solve this with a simple (or even extremely complex) operator LIKE

.

You need something that can compare two strings and return a score for similarity, a score of 100% means identical. Something like the Yaro-Winkler algorithm . Alternative algorithms include Metaphone (or double metaphone) and Soundex()

. Soundex()

is the roughest solution.



An alternative solution would be to use a specialized text index. The cool thing about this approach is that we can specify a thesaurus to indicate synonyms that say non-local differences (INC = INCORPORATED, CO = COMPANY, etc.).

Oracle and SQL Server include such a tool, but I am not familiar with MySQL.

+2


source


SOUNDEX () will help you to a certain extent. But this is far from ideal.

soundex (string1) is expected to equal soundex (string2) even though strings1 and string2 are written differently. But, as I said, it is far from perfect.

As far as I know, there is no existing algorithm that does this perfectly.

+1


source


Well, there is no 100% guaranteed-correct way, no. But you can probably make some progress by turning all the dirty columns into a more canonical form , for example. by capitalizing everything, trimming leading and trailing spaces and securing no more than 1 space in a row. Also things like changing the names of the form "SMITH, JOHN" to "JOHN SMITH" (or vice versa - just select the form and go with it). And of course you must make copies of the recordings, don't change the originals. You can experiment with discarding additional information (eg "JOHN SMITH" → "J SMITH") - you will find that this changes the balance of false positives to false negatives.

I would probably take the approach of assigning a similarity score to each pair of records. For example. if the canonicalized names, addresses, and email addresses exactly match, assign 1000 points; otherwise, subtract (multiples of) the Levenshtein distance from 1000 and use that. You need to come up with your own scoring scheme by playing around and determining the relative importance of different types of differences (for example, another digit in a phone number is probably more important than a 1 character difference in two people's names). Then you can experimentally establish a score above which you can confidently assign the status "duplicate" to a pair of records and a lower score, above which manual verification is required; below this score, we can confidently say that 2 records are not duplicated.

A realistic goal here is to reduce the amount of manual duplicate-delete work that you will need to do. ... You are unlikely to be able to completely eliminate it, unless all duplicates have been created through some automatic copying process.

0


source







All Articles