Find all matches in a vector in the data table.
This question is a continuation of this previous question .
I have a vector id, sampleIDs
. I also have a data.table, rec_data_table
injected with a bet and containing a column
A_IDs.list
, where each item is a set (vector) of IDs.
I would like to create a second data table containing sampleIDs
and where For each aID
there is a corresponding vector of all BIDs for which that aID appears in the column A_IDs.list
.
Example:
> rec_data_table
bid counts names_list A_IDs.list
1: 301 21 C,E 3,NA
2: 302 21 E NA
3: 303 5 H,E,G 8,NA,7
4: 304 10 H,D 8,4
5: 305 3 E NA
6: 306 5 G 7
7: 307 6 B,C 2,3
> sampleIDs
[1] 3 4 8
AB.dt <- data.table(aID=sampleIDs, key="aID")
# unkown step
AB.dt[ , bIDs := ???? ]
# desired result:
> AB.dt
aid bIDs
1: 3 301,307
2: 4 304
3: 8 303,304
I've tried several different lines inside the call AB.dt[]
. The closest I could get was
rec_data_table[sapply(A_IDs.list, function(lst) aID %in% lst), bID]
which will give me the desired output for the given aID
one and I can use over sampleIDs
to create a list of vectors and plot the desired output.
However, I suspect that there must be a more "suitable" data.table method to achieve this. Any suggestions are greatly appreciated.
#--------------------------------------------------#
# SAMPLE DATA #
library(data.table)
set.seed(101)
rows <- size <- 7
varyingLengths <- c(sample(1:3, rows, TRUE))
A <- lapply(varyingLengths, function(n) sample(LETTERS[1:8], n))
counts <- round(abs(rnorm(size)*12))
rec_data_table <- data.table(bID=300+(1:size), counts=counts, names_list=A, key="bID")
A_ids.DT <- data.table(name=LETTERS[c(1:4,6:8,10:11)], id=c(1:4,6:8,10:11), key="name")
rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))]
sampleIDs <- c(3, 4, 8)
source to share
After accession tmp
to A_ids.DT
the answer to the previous question, you can get the desired result by looking sampleIDs
at tmp
:
# ... from previous answer
# tmp <- A_ids.DT[tmp]
AB.dt <- setkey(tmp, id)[J(sampleIDs)][, list(bIDs = list(bID)),
by = list(aid = id)]
# setkey(tmp, orig.order)
# previous answer continues ...
Note that your column heading bID
is different in these two questions. This assumes, of course, that you are not doing the second on the last line in your sample data. This should be faster than the based approaches %in%
when there are many records because of the wonders of binary search data.table
.
source to share
I think this gives the desired output:
myfun <- function(ids) {
any(ids %in% sampleIDs)
}
rec_data_table[sapply(A_IDs.list, myfun),]
# bID counts names_list A_IDs.list
# 1: 301 21 C,E 3,NA
# 2: 303 5 H,E,G 8,NA,7
# 3: 304 10 H,D 8,4
# 4: 307 6 B,C 2,3
rec_data_table[sapply(A_IDs.list, myfun), list(bID, A_IDs.list)]
# bID A_IDs.list
# 1: 301 3,NA
# 2: 303 8,NA,7
# 3: 304 8,4
# 4: 307 2,3
You can use unlist
on column A_IDs.list
to get long data table.
unique(na.omit(rec_data_table[sapply(A_IDs.list, myfun), list(bID, unlist(A_IDs.list))]))
# bID V2
# 1: 301 3
# 2: 304 8
# 3: 301 7
# 4: 303 8
# 5: 304 4
# 6: 307 2
I would suggest working with "long" data rather than the nested list construct you had above, as this often results in significantly simpler code.
source to share