"fuzzy key mapping" for combining data.

Question

"fuzzy key mapping" for combining data.

I am trying to match workers from year to year using name strings and specific experiences. Experience can only increase for no more than one year from year to year, so I would like to use this to help reconcile with other metrics.

For example:

dt1<-data.table(name=c("jane doe","jane doe",
                       "john doe","jane smith"),
                exp=c(0.,5,1,2),id=1:4,key="name")
dt2<-data.table(name=c("jane doe","jane doe",
                       "john doe","jane smith"),
                exp=c(0,30,1.5,2),key="name")

I want to match the first "jane doe" in dt1

with the first "jane doe" in dt2

. The last "jane doe" are not the same because they are clearly different people (based on completely different levels of experience).

I would also like to add some flags so that I know that I compared these people in this way later down the line. Here's my first pass:

dt2[dt1,`:=`(id=ifelse(exp<=i.exp+1,i.id,NA),
             flag=ifelse(exp<=i.exp+1,i.id,NA))]

But it doesn't work - this is what it gives me:

> dt2
         name  exp id flag
1:   jane doe  0.0  2    2
2:   jane doe 30.0 NA   NA
3: jane smith  2.0  4    4
4:   john doe  1.5  3    3

Seems to have correctly missed the match for the last "jane doe", but it looks like it matched the first "jane doe" with the wrong predecessor "jane doe". I do not quite understand why this is so; at any rate, it is preferable to have a way to enable the collation on exp

earlier than after the concatenation - this will also clear up the mess ifelse

when defining new variables. Any suggestions?

For clarity, here's the desired result:

> dt2
         name  exp id flag
1:   jane doe  1.0  1    1
2:   jane doe 30.0 NA   NA
3: jane smith  2.0  4    1
4:   john doe  1.5  3    1

+3

r data.table

MichaelChirico Apr 26. At 4:19 am

source to share

1 answer

David Arenburg · Accepted Answer · 2015-04-26T21:00:18+0000

In your case, the connection is not really "fuzzy". All you are trying to do is join name

on exp

, allowing one year distance per match. This is useful for sliding joins with a BOM -1L

.

First, we will enter the datasets correctly

setkey(dt1, name, exp) 
setkey(dt2, name, exp)

Then we will perform a sliding join, passing -1L

as its value

dt2[dt1, `:=`(id = i.id, flag = 1L), roll = -1L]
df2
#          name  exp id flag
# 1:   jane doe  0.0  1    1
# 2:   jane doe 30.0 NA   NA
# 3: jane smith  2.0  4    1
# 4:   john doe  1.5  3    1

In the future, if you need to draw an interval, for example c(1L, -1L)

, you can look here for some function examples foverlaps

.

"fuzzy key mapping" for combining data.

More articles: