R data.table (<= 1.9.4) connection behavior

I'll come back to using r and data.table after a while and I still have a connection problem. I have previously asked this question , which led to a satisfactory explanation, but I still don't understand the logic. Let's look at a few examples:

library("data.table")
X <- data.table(chiave=c("a", "a", "a", "b", "b"),valore1=1:5)
Y <- data.table(chiave=c("a", "b", "c", "d"),valore2=1:4)
X
   chiave valore1
1:      a       1
2:      a       2
3:      a       3
4:      b       4
5:      b       5
 Y
   chiave valore2
1:      a       1
2:      b       2
3:      c       3
4:      d       4

      

when i join i get error:

 setkey(X,chiave)
 X[Y]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  : 
  Join results in 7 rows; more than 5 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Qaru and datatable-help for advice.

      

So:

 X[Y,allow.cartesian=T]
   chiave valore1 valore2
1:      a       1       1
2:      a       2       1
3:      a       3       1
4:      b       4       2
5:      b       5       2
6:      c      NA       3
7:      d      NA       4

      

Note that it X

has duplicate keys, but i

not. If I changed Y

to:

 Y <- data.table(chiave=c("b", "c", "d"),valore2=1:3)
 Y
   chiave valore2
1:      b       1
2:      c       2
3:      d       3

      

The connection is made without an error message and there is no need for allow.cartesian, but logically the situation is the same: it X

has multiple keys, but i

not.

 X[Y]
   chiave valore1 valore2
1:      b       4       1
2:      b       5       1
3:      c      NA       2
4:      d      NA       3

      

On the other hand:

 X <- data.table(chiave=c("a", "a", "a", "a", "a", "a", "b", "b"),valore1=1:8)
 Y <- data.table(chiave=c("b", "b", "d"),valore2=1:3)
 X
   chiave valore1
1:      a       1
2:      a       2
3:      a       3
4:      a       4
5:      a       5
6:      a       6
7:      b       7
8:      b       8
 Y
   chiave valore2
1:      b       1
2:      b       2
3:      d       3

      

I have multiple keys both in X

and out i

, but the join (and cartesian product) is done without an error message and is not requiredallow.cartesian

 setkey(X,chiave)
 X[Y]
   chiave valore1 valore2
1:      b       7       1
2:      b       8       1
3:      b       7       2
4:      b       8       2
5:      d      NA       3

      

From my point of view, I need to warn if and only if I have multiple keys in both X and I (not just if the resulting table has more rows than max(nrow(x),nrow(i)

)), and only in this case I see the need allow.cartesian

(therefore not in my first two examples).

+3


source share


1 answer


To answer this question, this c behavior has allow.cartesian

been fixed in the current development release v1.9.5

and will soon be available in CRAN as v1.9.6

. Odd versions are mature and even stable. FROM NEWS :



  1. allow.cartesian

    ignored during connections when:

    • i

      has no duplicates and mult="all"

      . Closes # 742 . Thanks to @nigmastar for the report.
    • assignment by reference, i.e. j

      has :=

      . Closes # 800 . Thanks to @matthieugomez for the report.

    In both cases (and during not-join

    , which was already fixed in 1.9.4 ), allow.cartesian

    you can safely ignore.

+2


source







All Articles