How to map people between separate systems using SQL?

I would like to know if there is a way to map people between two separate systems using (mostly) SQL.

We have two separate Oracle databases where people are stored. There is no relationship between them (i.e., cannot join person_id); this is intentional. I would like to create a query that checks if any group of people from system A exists in system B.

I can create tables if that makes it easier. I can also run queries and do some data manipulation in Excel when generating my final report. I am not very familiar with PL / SQL.

In system A, we have information about people (name, DOB, soc, sex, etc.). In system B, we have the same information about people. There might be data entry errors (the person is typing the wrong spelling), but I'm not going to worry too much about that, other than maybe just comparing the first four letters. This question is more specifically about this issue .

They do it, I thought, via correlated subqueries. So, roughly speaking,

select a.lastname, a.firstname, a.soc, a.dob, a.gender
  case 
    when exists (select 1 from b where b.lastname = a.lastname) then 'Y' else 'N'
  end last_name,
  case 
    when exists (select 1 from b where b.firstname = a.firstname) then 'Y' else 'N'
  end first_name, 
  case [etc.]
from a

      

This gives me what I want, I think ... I can export the results to Excel and then find records that have 3 or more matches. I believe this shows that the given field from A was found in B. However, I ran this query with only these three fields and it took over 3 hours (I am looking at data for 2 years). I would like to be able to match up to 5 criteria (last name, first name, gender, date of birth, soc). Also, while hit count is the best choice for matching, it is also the piece of data that is most often missing. What's the best way to do this? Thank.

+2


source to share


5 answers


You definitely want to weigh the different matches. If the SSN matches, that's a pretty good indicator. If firstName matches, it is mostly useless.

You can try a scoring method based on match weights combined with the phonetic string matching algorithms you are associated with. Here's an example I whipped up in T-SQL. It needs to be ported to Oracle for your problem.

--Score Threshold to be returned
DECLARE @Threshold DECIMAL(5,5) = 0.60

--Weights to apply to each column match (0.00 - 1.00)
DECLARE @Weight_FirstName DECIMAL(5,5) = 0.10
DECLARE @Weight_LastName DECIMAL(5,5) = 0.40
DECLARE @Weight_SSN DECIMAL(5,5) = 0.40
DECLARE @Weight_Gender DECIMAL(5,5) = 0.10

DECLARE @NewStuff TABLE (ID INT IDENTITY PRIMARY KEY, FirstName VARCHAR(MAX), LastName VARCHAR(MAX), SSN VARCHAR(11), Gender VARCHAR(1))
INSERT INTO @NewStuff
    ( FirstName, LastName, SSN, Gender )
VALUES  
    ( 'Ben','Sanders','234-62-3442','M' )

DECLARE @OldStuff TABLE (ID INT IDENTITY PRIMARY KEY, FirstName VARCHAR(MAX), LastName VARCHAR(MAX), SSN VARCHAR(11), Gender VARCHAR(1))
INSERT INTO @OldStuff
    ( FirstName, LastName, SSN, Gender )
VALUES
    ( 'Ben','Stickler','234-62-3442','M' ), --3/4 Match
    ( 'Albert','Sanders','523-42-3441','M' ), --2/4 Match
    ( 'Benne','Sanders','234-53-2334','F' ), --2/4 Match
    ( 'Ben','Sanders','234623442','M' ), --SSN has no dashes
    ( 'Ben','Sanders','234-62-3442','M' ) --perfect match

SELECT 
    'NewID' = ns.ID,
    'OldID' = os.ID,

    'Weighted Score' = 
        (CASE WHEN ns.FirstName = os.FirstName THEN @Weight_FirstName ELSE 0 END)
        +
        (CASE WHEN ns.LastName = os.LastName THEN @Weight_LastName ELSE 0 END)
        +
        (CASE WHEN ns.SSN = os.SSN THEN @Weight_SSN ELSE 0 END)
        +
        (CASE WHEN ns.Gender = os.Gender THEN @Weight_Gender ELSE 0 END)
    ,   

    'RAW Score' = CAST(
        ((CASE WHEN ns.FirstName = os.FirstName THEN 1 ELSE 0 END)
        +
        (CASE WHEN ns.LastName = os.LastName THEN 1 ELSE 0 END) 
        +
        (CASE WHEN ns.SSN = os.SSN THEN 1 ELSE 0 END) 
        +
        (CASE WHEN ns.Gender = os.Gender THEN 1 ELSE 0 END) ) AS varchar(MAX))
        + 
        ' / 4',

    os.FirstName ,
    os.LastName ,
    os.SSN ,
    os.Gender

FROM @NewStuff ns

--make sure that at least one item matches exactly
INNER JOIN @OldStuff os ON 
    os.FirstName = ns.FirstName OR
    os.LastName = ns.LastName OR
    os.SSN = ns.SSN OR
    os.Gender = ns.Gender
where 
    (CASE WHEN ns.FirstName = os.FirstName THEN @Weight_FirstName ELSE 0 END)
    +
    (CASE WHEN ns.LastName = os.LastName THEN @Weight_LastName ELSE 0 END)
    +
    (CASE WHEN ns.SSN = os.SSN THEN @Weight_SSN ELSE 0 END)
    +
    (CASE WHEN ns.Gender = os.Gender THEN @Weight_Gender ELSE 0 END)
    >= @Threshold
ORDER BY ns.ID, 'Weighted Score' DESC

      



And then, here's the conclusion.

NewID OldID Weighted  Raw    First  Last      SSN          Gender
1     5     1.00000   4 / 4  Ben    Sanders   234-62-3442  M
1     1     0.60000   3 / 4  Ben    Stickler  234-62-3442  M
1     4     0.60000   3 / 4  Ben    Sanders   234623442    M

      

Then you will need to do some post-processing to assess the validity of each possible match. If you ever get 1.00 for a weighted score, you can assume it is the correct match, unless you get two of them. If you get the last name and SSN (total weight 0.8 in my example), you can be sure it is correct.

+1


source


I would probably use joins instead of correlated subqueries, but you will need to join on all fields, so not sure how much this can improve. But since correlated subqueries often have to be evaluated in queues and joins, can't that make things better if you have good indexing. But as with any performance tweak, only trying techinque will let you know for sure.

I did a similar task of finding duplicates on our SQL Server system and I broke it down into steps. So I first found everyone where the names and city / state were accurate. Then I looked for additional possible matches (phone number, ssn, imprecise name, etc. AS I found a possible match between the two profiles, I added it to an intermediate table with a code for what type of match I found. Then I assigned confidence to each type of match and adds confidence in every potential match.Therefore, if the SOC matches, you can get high confidence, the same if the name is eact and gender is accurate and the dob is accurate.the last name is accurate and the first name is not accurate etc. By adding confidence, I understood much better what possible mates are most likely to be the same person.SQl Server also has a soundex function that can help with names that are slightly different. I bet Oracle has something like this.



After that I learned how to do fuzzy grouping in SSIS and was able to generate more matches with a higher confidence level. I don't know if Oracle ETL tools have a way to do fuzzy logic, but if they do, it might actually help with this type of task. If you also have SQL Server, SSIS can be started by connecting to Oracle so you can use fuzzy grouping yourself. This can take a long time.

I warn you that name, dob and gender are unlikely to guarantee that they are the same person, especially for generic names.

+2


source


An example of an HLGEM JOIN clause:

SELECT a.lastname, 
       a.firstname, 
       a.soc, 
       a.dob, 
       a.gender
  FROM TABLE a
  JOIN TABLE b ON SOUNDEX(b.lastname) = SOUNDEX(a.lastname)
              AND SOUNDEX(b.firstname) = SOUNDEX(a.firstname)
              AND b.soc = a.soc
              AND b.dob = a.dob
              AND b.gender = a.gender

      

Link: SOUNDEX

+2


source


Are there indexes on all columns of table b in the WHERE clause? If not, this will result in a full table scan for every row of table a.

0


source


You can use soundex, but you can also use utl_match

for fuzzy string comparison, utl_match

lets define treshold: http://www.psoug.org/reference/utl_match.html

0


source







All Articles