"fuzzy key mapping" for combining data.

I am trying to match workers from year to year using name strings and specific experiences. Experience can only increase for no more than one year from year to year, so I would like to use this to help reconcile with other metrics.

For example:

dt1<-data.table(name=c("jane doe","jane doe",
                       "john doe","jane smith"),
                exp=c(0.,5,1,2),id=1:4,key="name")
dt2<-data.table(name=c("jane doe","jane doe",
                       "john doe","jane smith"),
                exp=c(0,30,1.5,2),key="name")

      

I want to match the first "jane doe" in dt1

with the first "jane doe" in dt2

. The last "jane doe" are not the same because they are clearly different people (based on completely different levels of experience).

I would also like to add some flags so that I know that I compared these people in this way later down the line. Here's my first pass:

dt2[dt1,`:=`(id=ifelse(exp<=i.exp+1,i.id,NA),
             flag=ifelse(exp<=i.exp+1,i.id,NA))]

      

But it doesn't work - this is what it gives me:

> dt2
         name  exp id flag
1:   jane doe  0.0  2    2
2:   jane doe 30.0 NA   NA
3: jane smith  2.0  4    4
4:   john doe  1.5  3    3

      

Seems to have correctly missed the match for the last "jane doe", but it looks like it matched the first "jane doe" with the wrong predecessor "jane doe". I do not quite understand why this is so; at any rate, it is preferable to have a way to enable the collation on exp

earlier than after the concatenation - this will also clear up the mess ifelse

when defining new variables. Any suggestions?


For clarity, here's the desired result:

> dt2
         name  exp id flag
1:   jane doe  1.0  1    1
2:   jane doe 30.0 NA   NA
3: jane smith  2.0  4    1
4:   john doe  1.5  3    1

      

+3


source to share


1 answer


In your case, the connection is not really "fuzzy". All you are trying to do is join name

on exp

, allowing one year distance per match. This is useful for sliding joins with a BOM -1L

.

First, we will enter the datasets correctly

setkey(dt1, name, exp) 
setkey(dt2, name, exp) 

      

Then we will perform a sliding join, passing -1L

as its value



dt2[dt1, `:=`(id = i.id, flag = 1L), roll = -1L]
df2
#          name  exp id flag
# 1:   jane doe  0.0  1    1
# 2:   jane doe 30.0 NA   NA
# 3: jane smith  2.0  4    1
# 4:   john doe  1.5  3    1

      


In the future, if you need to draw an interval, for example c(1L, -1L)

, you can look here for some function examples foverlaps

.

+5


source







All Articles