R data.table (<= 1.9.4) connection behavior
I'll come back to using r and data.table after a while and I still have a connection problem. I have previously asked this question , which led to a satisfactory explanation, but I still don't understand the logic. Let's look at a few examples:
library("data.table") X <- data.table(chiave=c("a", "a", "a", "b", "b"),valore1=1:5) Y <- data.table(chiave=c("a", "b", "c", "d"),valore2=1:4) X chiave valore1 1: a 1 2: a 2 3: a 3 4: b 4 5: b 5 Y chiave valore2 1: a 1 2: b 2 3: c 3 4: d 4
when i join i get error:
setkey(X,chiave) X[Y] # Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), : Join results in 7 rows; more than 5 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Qaru and datatable-help for advice.
So:
X[Y,allow.cartesian=T] chiave valore1 valore2 1: a 1 1 2: a 2 1 3: a 3 1 4: b 4 2 5: b 5 2 6: c NA 3 7: d NA 4
Note that it X
has duplicate keys, but i
not. If I changed Y
to:
Y <- data.table(chiave=c("b", "c", "d"),valore2=1:3) Y chiave valore2 1: b 1 2: c 2 3: d 3
The connection is made without an error message and there is no need for allow.cartesian, but logically the situation is the same: it X
has multiple keys, but i
not.
X[Y] chiave valore1 valore2 1: b 4 1 2: b 5 1 3: c NA 2 4: d NA 3
On the other hand:
X <- data.table(chiave=c("a", "a", "a", "a", "a", "a", "b", "b"),valore1=1:8) Y <- data.table(chiave=c("b", "b", "d"),valore2=1:3) X chiave valore1 1: a 1 2: a 2 3: a 3 4: a 4 5: a 5 6: a 6 7: b 7 8: b 8 Y chiave valore2 1: b 1 2: b 2 3: d 3
I have multiple keys both in X
and out i
, but the join (and cartesian product) is done without an error message and is not requiredallow.cartesian
setkey(X,chiave) X[Y] chiave valore1 valore2 1: b 7 1 2: b 8 1 3: b 7 2 4: b 8 2 5: d NA 3
From my point of view, I need to warn if and only if I have multiple keys in both X and I (not just if the resulting table has more rows than max(nrow(x),nrow(i)
)), and only in this case I see the need allow.cartesian
(therefore not in my first two examples).
source share
To answer this question, this c behavior has allow.cartesian
been fixed in the current development release v1.9.5
and will soon be available in CRAN as v1.9.6
. Odd versions are mature and even stable. FROM NEWS :
allow.cartesian
ignored during connections when:
i
has no duplicates andmult="all"
. Closes # 742 . Thanks to @nigmastar for the report.- assignment by reference, i.e.
j
has:=
. Closes # 800 . Thanks to @matthieugomez for the report.In both cases (and during
not-join
, which was already fixed in 1.9.4 ),allow.cartesian
you can safely ignore.
source share