Find all matches in a vector in the data table.

Question

Find all matches in a vector in the data table.

This question is a continuation of this previous question .

I have a vector id, sampleIDs

. I also have a data.table, rec_data_table

injected with a bet and containing a column A_IDs.list

, where each item is a set (vector) of IDs.

I would like to create a second data table containing sampleIDs

and where For each aID

there is a corresponding vector of all BIDs for which that aID appears in the column A_IDs.list

.

Example:

> rec_data_table
   bid counts names_list A_IDs.list
1: 301     21        C,E       3,NA
2: 302     21          E         NA
3: 303      5      H,E,G     8,NA,7
4: 304     10        H,D        8,4
5: 305      3          E         NA
6: 306      5          G          7
7: 307      6        B,C        2,3

> sampleIDs
[1] 3 4 8

AB.dt <- data.table(aID=sampleIDs, key="aID")

# unkown step
AB.dt[ , bIDs := ????  ]

# desired result:
> AB.dt
    aid     bIDs
1:    3  301,307
2:    4      304
3:    8  303,304

I've tried several different lines inside the call AB.dt[]

. The closest I could get was

rec_data_table[sapply(A_IDs.list, function(lst) aID %in% lst), bID]

which will give me the desired output for the given aID

one and I can use over sampleIDs

to create a list of vectors and plot the desired output.

However, I suspect that there must be a more "suitable" data.table method to achieve this. Any suggestions are greatly appreciated.

#--------------------------------------------------#
#           SAMPLE DATA                            #

library(data.table)
set.seed(101)

  rows <- size <- 7
  varyingLengths <- c(sample(1:3, rows, TRUE))
  A <-  lapply(varyingLengths, function(n) sample(LETTERS[1:8], n))
  counts <- round(abs(rnorm(size)*12))   
rec_data_table <- data.table(bID=300+(1:size), counts=counts, names_list=A, key="bID")

A_ids.DT <- data.table(name=LETTERS[c(1:4,6:8,10:11)], id=c(1:4,6:8,10:11), key="name")
rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))]
sampleIDs <- c(3, 4, 8)

+3

r data.table

Ricardo saporta 18 jan. At 15:45

source to share

3 answers

I think this gives the desired output:

myfun <- function(ids) {
  any(ids %in% sampleIDs)
}

rec_data_table[sapply(A_IDs.list, myfun),]

#    bID counts names_list A_IDs.list
# 1: 301     21        C,E       3,NA
# 2: 303      5      H,E,G     8,NA,7
# 3: 304     10        H,D        8,4
# 4: 307      6        B,C        2,3

rec_data_table[sapply(A_IDs.list, myfun), list(bID, A_IDs.list)]

#   bID A_IDs.list
# 1: 301       3,NA
# 2: 303     8,NA,7
# 3: 304        8,4
# 4: 307        2,3

You can use unlist

on column A_IDs.list

to get long data table.

unique(na.omit(rec_data_table[sapply(A_IDs.list, myfun), list(bID, unlist(A_IDs.list))]))

#    bID V2
# 1: 301  3
# 2: 304  8
# 3: 301  7
# 4: 303  8
# 5: 304  4
# 6: 307  2

I would suggest working with "long" data rather than the nested list construct you had above, as this often results in significantly simpler code.

0

Justin 18 jan. At 16:28

source to share

bIDs <- lapply(sampleIDs, function(x){rec_data_table$bID[sapply(rec_data_table$A_IDs.list, function(y){x %in% y})]})
AB.dt <- data.table(aID=sampleIDs, bIDs=bIDs)

maybe there is a faster way, but this one works. :)

0

dgrigonis 18 jan. 13 at 16:37

source to share

user1935457 · Accepted Answer · 2013-01-18T18:02:27+0000

After accession tmp

to A_ids.DT

the answer to the previous question, you can get the desired result by looking sampleIDs

at tmp

:

# ... from previous answer
# tmp <- A_ids.DT[tmp]

AB.dt <- setkey(tmp, id)[J(sampleIDs)][, list(bIDs = list(bID)),
                                       by = list(aid = id)]

# setkey(tmp, orig.order)
# previous answer continues ...

Note that your column heading bID

is different in these two questions. This assumes, of course, that you are not doing the second on the last line in your sample data. This should be faster than the based approaches %in%

when there are many records because of the wonders of binary search data.table

.

Find all matches in a vector in the data table.

More articles: