One-to-Many Left Join / Merge in Data.Table in R

If you know the answer to this question elsewhere, please let me know, but I have searched and read a large number of data.table related articles on joining, as well as the intro and package FAQs and still haven't found the answer. Basically I'm looking for something like a flexible left joint, and I would have thought I would need to configure the keys directly so that A [B] does the trick, but it doesn't.

Essentially I have two datasets, A and B (example below). I would like to add data to in B based on their IDs and years match, resulting in something like C. There are a few things that make this different from the examples I've seen. First, its one for many, based on several keys (here id and year). Secondly, I do not need all possible combinations, that is, if information in B is not needed in B, then this is not added. Third, I would like to store B where the data is not available (so not just the intersection).

Again, I thought I need to set keys in and B as id and year, then C = A [B] will give me the answer, but I get an error that many (many, many) rows as a result, causing a Cartesian warning ... However, the result should have the exact number of lines as B, so its not clear to me what I am doing wrong.

Any advice or ideas are greatly appreciated!

A
#    id   year   var1    var2
#    1    2010    10     20
#    1    2011    70     20
#    1    2012    90     30
#    1    2013    40     50 
#    2    2010    30     30
#    2    2011    50     60
#    2    2012    80     10
#    2    2013    90     80 
#    3    2010    50     50
#    3    2011    20     40
#    3    2012    90     30
#    3    2013    60     70 


B
#    id2  year    id    var3
#    1    2010    1     20
#    1    2011    1     20
#    1    2012    3     30
#    1    2013    4     50 
#    2    2010    1     30
#    2    2011    1     60
#    2    2012    1     10
#    2    2013    2     80 
#    3    2010    3     50
#    3    2011    2     40
#    3    2012    3     30
#    3    2013    2     70 

C
#    id2  year    id    var3   var1   var2
#    1    2010    1     20      10     20
#    1    2011    1     20      70     20
#    1    2012    3     30      90     30
#    1    2013    4     50      NA     NA 
#    2    2010    1     30      10     20
#    2    2011    1     60      70     20
#    2    2012    1     10      90     30
#    2    2013    2     80      90     80 
#    3    2010    3     50      50     50
#    3    2011    2     40      50     60
#    3    2012    3     30      90     30
#    3    2013    2     70      90     80  

      

+3


source to share


1 answer


You have not specified what you would like to see as the end result. You have three main options:

 A[B]
 B[A]
 merge(A, B, all=TRUE)

      

Note that the error allow.cartesian

is what you will see if there are duplicate keys in the join table. In this case, you can simply set a flag TRUE

or remove duplicates instead



 B[unique(A)]
 # or
 B[A, allow=TRUE]

      


setkey(A, id, year)
setkey(B, id, year)


A[B]
#     id year var1 var2 id2 var3
#  1:  1 2010   10   20   1   20
#  2:  1 2010   10   20   2   30
#  3:  1 2011   70   20   1   20
#  4:  1 2011   70   20   2   60
#  5:  1 2012   90   30   2   10
#  6:  2 2011   50   60   3   40
#  7:  2 2013   90   80   2   80
#  8:  2 2013   90   80   3   70
#  9:  3 2010   50   50   3   50
# 10:  3 2012   90   30   1   30
# 11:  3 2012   90   30   3   30
# 12:  4 2013   NA   NA   1   50

B[A, allow=TRUE]
#     id year id2 var3 var1 var2
#  1:  1 2010   1   20   10   20
#  2:  1 2010   2   30   10   20
#  3:  1 2011   1   20   70   20
#  4:  1 2011   2   60   70   20
#  5:  1 2012   2   10   90   30
#  6:  1 2013  NA   NA   40   50
#  7:  2 2010  NA   NA   30   30
#  8:  2 2011   3   40   50   60
#  9:  2 2012  NA   NA   80   10
# 10:  2 2013   2   80   90   80
# 11:  2 2013   3   70   90   80
# 12:  3 2010   3   50   50   50
# 13:  3 2011  NA   NA   20   40
# 14:  3 2012   1   30   90   30
# 15:  3 2012   3   30   90   30
# 16:  3 2013  NA   NA   60   70

merge(A, B, all=TRUE, allow=TRUE)
#     id year var1 var2 id2 var3
#  1:  1 2010   10   20   1   20
#  2:  1 2010   10   20   2   30
#  3:  1 2011   70   20   1   20
#  4:  1 2011   70   20   2   60
#  5:  1 2012   90   30   2   10
#  6:  1 2013   40   50  NA   NA
#  7:  2 2010   30   30  NA   NA
#  8:  2 2011   50   60   3   40
#  9:  2 2012   80   10  NA   NA
# 10:  2 2013   90   80   2   80
# 11:  2 2013   90   80   3   70
# 12:  3 2010   50   50   3   50
# 13:  3 2011   20   40  NA   NA
# 14:  3 2012   90   30   1   30
# 15:  3 2012   90   30   3   30
# 16:  3 2013   60   70  NA   NA
# 17:  4 2013   NA   NA   1   50

      

+4


source







All Articles