One-to-Many Left Join / Merge in Data.Table in R
If you know the answer to this question elsewhere, please let me know, but I have searched and read a large number of data.table related articles on joining, as well as the intro and package FAQs and still haven't found the answer. Basically I'm looking for something like a flexible left joint, and I would have thought I would need to configure the keys directly so that A [B] does the trick, but it doesn't.
Essentially I have two datasets, A and B (example below). I would like to add data to in B based on their IDs and years match, resulting in something like C. There are a few things that make this different from the examples I've seen. First, its one for many, based on several keys (here id and year). Secondly, I do not need all possible combinations, that is, if information in B is not needed in B, then this is not added. Third, I would like to store B where the data is not available (so not just the intersection).
Again, I thought I need to set keys in and B as id and year, then C = A [B] will give me the answer, but I get an error that many (many, many) rows as a result, causing a Cartesian warning ... However, the result should have the exact number of lines as B, so its not clear to me what I am doing wrong.
Any advice or ideas are greatly appreciated!
A
# id year var1 var2
# 1 2010 10 20
# 1 2011 70 20
# 1 2012 90 30
# 1 2013 40 50
# 2 2010 30 30
# 2 2011 50 60
# 2 2012 80 10
# 2 2013 90 80
# 3 2010 50 50
# 3 2011 20 40
# 3 2012 90 30
# 3 2013 60 70
B
# id2 year id var3
# 1 2010 1 20
# 1 2011 1 20
# 1 2012 3 30
# 1 2013 4 50
# 2 2010 1 30
# 2 2011 1 60
# 2 2012 1 10
# 2 2013 2 80
# 3 2010 3 50
# 3 2011 2 40
# 3 2012 3 30
# 3 2013 2 70
C
# id2 year id var3 var1 var2
# 1 2010 1 20 10 20
# 1 2011 1 20 70 20
# 1 2012 3 30 90 30
# 1 2013 4 50 NA NA
# 2 2010 1 30 10 20
# 2 2011 1 60 70 20
# 2 2012 1 10 90 30
# 2 2013 2 80 90 80
# 3 2010 3 50 50 50
# 3 2011 2 40 50 60
# 3 2012 3 30 90 30
# 3 2013 2 70 90 80
source to share
You have not specified what you would like to see as the end result. You have three main options:
A[B]
B[A]
merge(A, B, all=TRUE)
Note that the error allow.cartesian
is what you will see if there are duplicate keys in the join table. In this case, you can simply set a flag TRUE
or remove duplicates instead
B[unique(A)]
# or
B[A, allow=TRUE]
setkey(A, id, year)
setkey(B, id, year)
A[B]
# id year var1 var2 id2 var3
# 1: 1 2010 10 20 1 20
# 2: 1 2010 10 20 2 30
# 3: 1 2011 70 20 1 20
# 4: 1 2011 70 20 2 60
# 5: 1 2012 90 30 2 10
# 6: 2 2011 50 60 3 40
# 7: 2 2013 90 80 2 80
# 8: 2 2013 90 80 3 70
# 9: 3 2010 50 50 3 50
# 10: 3 2012 90 30 1 30
# 11: 3 2012 90 30 3 30
# 12: 4 2013 NA NA 1 50
B[A, allow=TRUE]
# id year id2 var3 var1 var2
# 1: 1 2010 1 20 10 20
# 2: 1 2010 2 30 10 20
# 3: 1 2011 1 20 70 20
# 4: 1 2011 2 60 70 20
# 5: 1 2012 2 10 90 30
# 6: 1 2013 NA NA 40 50
# 7: 2 2010 NA NA 30 30
# 8: 2 2011 3 40 50 60
# 9: 2 2012 NA NA 80 10
# 10: 2 2013 2 80 90 80
# 11: 2 2013 3 70 90 80
# 12: 3 2010 3 50 50 50
# 13: 3 2011 NA NA 20 40
# 14: 3 2012 1 30 90 30
# 15: 3 2012 3 30 90 30
# 16: 3 2013 NA NA 60 70
merge(A, B, all=TRUE, allow=TRUE)
# id year var1 var2 id2 var3
# 1: 1 2010 10 20 1 20
# 2: 1 2010 10 20 2 30
# 3: 1 2011 70 20 1 20
# 4: 1 2011 70 20 2 60
# 5: 1 2012 90 30 2 10
# 6: 1 2013 40 50 NA NA
# 7: 2 2010 30 30 NA NA
# 8: 2 2011 50 60 3 40
# 9: 2 2012 80 10 NA NA
# 10: 2 2013 90 80 2 80
# 11: 2 2013 90 80 3 70
# 12: 3 2010 50 50 3 50
# 13: 3 2011 20 40 NA NA
# 14: 3 2012 90 30 1 30
# 15: 3 2012 90 30 3 30
# 16: 3 2013 60 70 NA NA
# 17: 4 2013 NA NA 1 50
source to share