"fuzzy key mapping" for combining data.
I am trying to match workers from year to year using name strings and specific experiences. Experience can only increase for no more than one year from year to year, so I would like to use this to help reconcile with other metrics.
For example:
dt1<-data.table(name=c("jane doe","jane doe",
"john doe","jane smith"),
exp=c(0.,5,1,2),id=1:4,key="name")
dt2<-data.table(name=c("jane doe","jane doe",
"john doe","jane smith"),
exp=c(0,30,1.5,2),key="name")
I want to match the first "jane doe" in dt1
with the first "jane doe" in dt2
. The last "jane doe" are not the same because they are clearly different people (based on completely different levels of experience).
I would also like to add some flags so that I know that I compared these people in this way later down the line. Here's my first pass:
dt2[dt1,`:=`(id=ifelse(exp<=i.exp+1,i.id,NA),
flag=ifelse(exp<=i.exp+1,i.id,NA))]
But it doesn't work - this is what it gives me:
> dt2
name exp id flag
1: jane doe 0.0 2 2
2: jane doe 30.0 NA NA
3: jane smith 2.0 4 4
4: john doe 1.5 3 3
Seems to have correctly missed the match for the last "jane doe", but it looks like it matched the first "jane doe" with the wrong predecessor "jane doe". I do not quite understand why this is so; at any rate, it is preferable to have a way to enable the collation on exp
earlier than after the concatenation - this will also clear up the mess ifelse
when defining new variables. Any suggestions?
For clarity, here's the desired result:
> dt2
name exp id flag
1: jane doe 1.0 1 1
2: jane doe 30.0 NA NA
3: jane smith 2.0 4 1
4: john doe 1.5 3 1
source to share
In your case, the connection is not really "fuzzy". All you are trying to do is join name
on exp
, allowing one year distance per match. This is useful for sliding joins with a BOM -1L
.
First, we will enter the datasets correctly
setkey(dt1, name, exp)
setkey(dt2, name, exp)
Then we will perform a sliding join, passing -1L
as its value
dt2[dt1, `:=`(id = i.id, flag = 1L), roll = -1L]
df2
# name exp id flag
# 1: jane doe 0.0 1 1
# 2: jane doe 30.0 NA NA
# 3: jane smith 2.0 4 1
# 4: john doe 1.5 3 1
In the future, if you need to draw an interval, for example c(1L, -1L)
, you can look here for some function examples foverlaps
.
source to share