Strange behavior of subset with multiple keys using data.table
I have set multiple keys in the data.table, but when I try to select rows by multiple key values, they seem to return a row for each potential combination, but are populated with NA for rows that don't exist.
I can get a sample code in 1c this doc , so it must be something I just can't see. Any help would be much appreciated.
library(data.table)
dt = data.table(colA = 1:4,
colB = c("A","A","B","B"),
colC = 11:14)
setkey(dt,colA,colB)
print(dt)
# colA colB colC
# 1: 1 A 11
# 2: 2 A 12
# 3: 3 B 13
# 4: 4 B 14
print(
dt[.(2,"A")]
)
# As expected
# colA colB colC
# 1: 2 A 12
print(
dt[.(c(2,3),"A")]
)
# colA colB colC
# 1: 2 A 12
# 2: 3 A NA #Unexpected
print(
dt[.(unique(colA),"A")]
)
# colA colB colC
# 1: 1 A 11
# 2: 2 A 12
# 3: 3 A NA #Unexpected
# 4: 4 A NA #Unexpected
source to share
DT[i]
will search every line i
in strings DT
. By default, the line NA
shows inconsistent lines i
. Move the inconsistent lines instead, use nomatch = 0
:
dt[.(unique(colA),"A"), nomatch=0]
# colA colB colC
# 1: 1 A 11
# 2: 2 A 12
The argument is nomatch
covered in the OP's vignette. To find the latest vignette, use browseVignettes("data.table")
.
As a side note, there is no need to set the keys before joining. You can use instead on=
:
library(data.table)
dt2 = data.table(colA = 1:4,
colB = c("A","A","B","B"),
colC = 11:14)
dt2[.(unique(colA),"A"), on=.(colA, colB), nomatch=0]
# colA colB colC
# 1: 1 A 11
# 2: 2 A 12
See Arun's answer for details on why bindings are usually not required to improve performance on connections. It says:
Generally, unless there are repeated grouping / joining operations performed on the same key data table, there should be no discernible difference.
I usually only set the keys when I do the merge interactively, so I can skip the input on=
.
source to share