Strange behavior of subset with multiple keys using data.table

Question

Strange behavior of subset with multiple keys using data.table

I have set multiple keys in the data.table, but when I try to select rows by multiple key values, they seem to return a row for each potential combination, but are populated with NA for rows that don't exist.

I can get a sample code in 1c this doc , so it must be something I just can't see. Any help would be much appreciated.

library(data.table)

dt = data.table(colA = 1:4,
                colB = c("A","A","B","B"),
                colC = 11:14)

setkey(dt,colA,colB)

print(dt)
# colA colB colC
# 1:    1    A   11
# 2:    2    A   12
# 3:    3    B   13
# 4:    4    B   14

print(
  dt[.(2,"A")]
)
# As expected
# colA colB colC
# 1:    2    A   12

print(
  dt[.(c(2,3),"A")]
)
# colA colB colC
# 1:    2    A   12
# 2:    3    A   NA #Unexpected

print(
  dt[.(unique(colA),"A")]
)
# colA colB colC
# 1:    1    A   11
# 2:    2    A   12
# 3:    3    A   NA #Unexpected
# 4:    4    A   NA #Unexpected

+3

r data.table

Matt June 15. 17 at 23:40

source to share

1 answer

Frank · Accepted Answer · 2017-06-15T23:57:22+0000

DT[i]

will search every line i

in strings DT

. By default, the line NA

shows inconsistent lines i

. Move the inconsistent lines instead, use nomatch = 0

:

dt[.(unique(colA),"A"), nomatch=0]

#    colA colB colC
# 1:    1    A   11
# 2:    2    A   12

The argument is nomatch

covered in the OP's vignette. To find the latest vignette, use browseVignettes("data.table")

.

As a side note, there is no need to set the keys before joining. You can use instead on=

:

library(data.table)
dt2 = data.table(colA = 1:4,
                colB = c("A","A","B","B"),
                colC = 11:14)

dt2[.(unique(colA),"A"), on=.(colA, colB), nomatch=0]

#    colA colB colC
# 1:    1    A   11
# 2:    2    A   12

See Arun's answer for details on why bindings are usually not required to improve performance on connections. It says:

Generally, unless there are repeated grouping / joining operations performed on the same key data table, there should be no discernible difference.

I usually only set the keys when I do the merge interactively, so I can skip the input on=

.

Strange behavior of subset with multiple keys using data.table

More articles: