Sibling network using data.table

I would like to create a sibship network using data.table

.

My data looks like

indata <-
structure(list(id = c(1L, 2L, 3L, 4L, 12L, 13L, 14L, 15L), fid = c(NA, 
9L, 1L, 1L, 7L, 5L, 5L, 5L), mid = c(0L, NA, 2L, 2L, 6L, 6L, 
6L, 8L)), .Names = c("id", "fid", "mid"), class = "data.frame", row.names = 
c(NA, -8L))

      

which the

  id fid mid
1  1  NA   0
2  2   9  NA
3  3   1   2
4  4   1   2
5 12   7   6
6 13   5   6
7 14   5   6
8 15   5   8

      

The three columns represent the id, id of the father, and id of the mother, respectively. 0

or NA

means it is not available. So in the above data, person 3 and 4 are full brothers and sisters (they both have a father 1

and mother 2

), and 12 and 13 are half a brother (they have different fathers, but the same mother, 6

).

For each line in the dataframe, I need a list of siblings (let's just consider half of the siblings). My ideal end result would be

  id fid mid sibs
1  1  NA   0 NA
2  2   9  NA NA
3  3   1   2 4
4  4   1   2 3
5 12   7   6 13, 14
6 13   5   6 12, 14, 15
7 14   5   6 12, 13, 15
8 15   5   8 13, 14

      

where the last column sibs

is a list or a vector (and it doesn't have to be part of the dataset).

A rough version for getting the result using R base is given below

# get a list of offspring for each father id
foffspring <- by(indata, indata$fid, function(i) { i$id }, simplify=FALSE) 
# and mother id
moffspring <- by(indata, indata$mid, function(i) { i$id }, simplify=FALSE)

      

To get siblings to go through each id. Find their father and mother and combine the two matching entries from the previous lists.

sibs <- sapply( 1:nrow(indata), function(i) {
    res <- c()
    if( !is.na(indata$fid[i]) ) 
        res <- c(res, unlist(foffspring[paste0(indata$fid[i])]))
    if( !is.na(indata$mid[i]) ) 
        res <- c(res, unlist(moffspring[paste0(indata$mid[i])]))
    unique(res[res != indata$id[i]]) 
    }, simplify=TRUE )

      

This creates

> sibs
[[1]]
integer(0)

[[2]]
integer(0)

[[3]]
[1] 4

[[4]]
[1] 3

[[5]]
[1] 13 14

[[6]]
[1] 14 15 12

[[7]]
[1] 13 15 12

[[8]]
[1] 13 14

      

which was the desired outlet. Now the code above is not fast or pretty and I would really like to know if I can get a fantastic version data.table

. However, it seems to me that data.table

-fu is missing.

library(data.table)
DT <- data.table(indata)
# Create lists with the _indices_ of the offsprings
FT <- DT[ , list( yidx = list(.I) ) , by = fid ]
MT <- DT[ , list( yidx = list(.I) ) , by = mid ]

      

MT

as follows

   mid  yidx
1:  NA     2
2:   0     1
3:   2   3,4
4:   6 5,6,7
5:   8     8

      

Exactly as moffspring

above, except that it contains indices and not labels. This, however, is not a problem. Then I would like to concatenate the tables together

setkey(DT, fid)
setkey(FT, fid)
setkey(MT, mid)

# Inner join
P1 <- DT[FT]

# And inner join on mother
setkey(P1, mid)
P1[MT]

      

and now the end result looks like this:

   id fid mid  yidx i.yidx
1:  2   9  NA     2      2
2:  1  NA   0     1      1
3:  3   1   2   3,4    3,4
4:  4   1   2   3,4    3,4
5: 13   5   6 6,7,8  5,6,7
6: 14   5   6 6,7,8  5,6,7
7: 12   7   6     5  5,6,7
8: 15   5   8 6,7,8      8

      

It's almost. Now if I accept string concatenation yidx

and i.yidx

then I end up with a list of half sibs (including myself) and intersecting the strings will result in a full sibling. Note that the indices here are relative to the index in DT

, not the final data.table

, but can also be corrected.

However ... I have a feeling that something like this can be done much more efficiently in a few lines of code data.table

and a "gentle wave of the hand". Can anyone point me in the right direction?

[Sorry for the super long entry]


Update based on answers below. Just for fun, I ran three different sentences through microbenchmark

to see if there would be any time differences between the three approaches. f1()

is @Frank's suggestion, f2()

is the solution given by @mtoto, and f3

is @amatsuo_net's approach. Tried on vectors of length 1000, but the output

Unit: milliseconds
 expr       min        lq      mean    median        uq      max neval cld
 f1() 4020.8112 4387.7950 4614.7896 4498.8043 4770.1184 6837.672   100   c
 f2()  656.9575  685.7706  727.5191  710.3003  735.2832 1080.423   100 a  
 f3() 1637.8927 1706.7528 1789.1794 1739.4428 1814.7776 2403.474   100  b 

      

Quite a significant difference in approaches. I need to run it across a dataset with 7 million IDs, which certainly has a noticeable impact. Thanks everyone!

+3


source to share


3 answers


Here's an approach using mapply()

in combination with setdiff()

and union()

. After collecting id

into the list, we first exclude the current identifier and then the union()

lists on both sides:



setDT(indata)[,msib:=.(list(id)), by = "mid"][
  ,msibs := mapply(setdiff, msib, id)][
  ,fsib  := .(list(id)), by = "fid"][
  ,fsibs := mapply(setdiff, fsib, id)][
  ,sibs  := mapply(union, msibs, fsibs)][
  ,c("msib","msibs", "fsib", "fsibs") := NULL]
> indata
#   id fid mid     sibs
#1:  1  NA   0         
#2:  2   9  NA         
#3:  3   1   2        4
#4:  4   1   2        3
#5: 12   7   6    13,14
#6: 13   5   6 12,14,15
#7: 14   5   6 12,13,15
#8: 15   5   8    13,14

      

+1


source


I would hold back on list columns as long as possible.

Starting with siblings, here's a simple approach:

sibDT = DT[!is.na(fid) & !is.na(mid), 
  CJ(id = id, sid = id)[id != sid]
, by=.(fid, mid)]

#    fid mid id sid
# 1:   1   2  3   4
# 2:   1   2  4   3
# 3:   5   6 13  14
# 4:   5   6 14  13

      

And then define half of the siblings as the split of the parent, but not showing up in sibDT

:

hsibDT = melt(DT, id = "id")[!is.na(value), 
  CJ(id = id, hsid = id)[id != hsid]
, by=.(ptype = variable, pid = value)][!sibDT, on=.(id, hsid = sid)]

#    ptype pid id hsid
# 1:   fid   5 13   15
# 2:   fid   5 14   15
# 3:   fid   5 15   13
# 4:   fid   5 15   14
# 5:   mid   6 12   13
# 6:   mid   6 12   14
# 7:   mid   6 13   12
# 8:   mid   6 14   12

      




I would stop here, but to view the results using a list column or symbol ...

DT[sibDT[, .(sibs = toString(sid)), by=id], on=.(id), sibs := i.sibs, by=.EACHI ]
DT[hsibDT[, .(hsibs = toString(hsid)), by=id], on=.(id), hsibs := i.hsibs, by=.EACHI ]

# or...

DT[
  rbind(sibDT[, .(id, oid = sid)], hsibDT[, .(id, oid = hsid)])[, 
    .(asibs = toString(oid))
  , by=.(id)], 
  on = .(id),
  asibs := i.asibs
, by = .EACHI]

      

which gives

   id fid mid sibs  hsibs      asibs
1:  1  NA   0   NA     NA         NA
2:  2   9  NA   NA     NA         NA
3:  3   1   2    4     NA          4
4:  4   1   2    3     NA          3
5: 12   7   6   NA 13, 14     13, 14
6: 13   5   6   14 15, 12 14, 15, 12
7: 14   5   6   13 15, 12 13, 15, 12
8: 15   5   8   NA 13, 14     13, 14

      

Adding these columns to DT

is counterproductive if your analysis is not complete. I assume that any useful analysis will be contained in unlisted columns contained in various tables.

+2


source


I would do with something like this.

library(data.table)
library(dplyr)
setDT(indata)
tmp <- merge(indata, indata[, 1:2], by = "fid", allow.cartesian = TRUE)
tmp2 <- merge(indata, indata, by = "mid", allow.cartesian = TRUE)
tmp3 <- rbindlist(list(tmp,tmp2), fill = T)
dt_siblings <- tmp3[id.x != id.y, unique(id.y) %>% sort() %>% paste(collapse = ", "), by = id.x][order(id.x)]
setnames(dt_siblings, 'id.x', 'id')
setnames(dt_siblings, 'V1', 'siblings')
outdata <- merge(indata, dt_siblings, all.x = T)

      

The idea is to combine indata

with indata

on fid

and mid

then rbindlist

them. Column id.y

in tmp3

is the siblings id (with duplicates due to full siblings). On the next line, remove the duplicate, sort and then merge. The result looks like this:

> outdata
   id fid mid   siblings
1:  1  NA   0         NA
2:  2   9  NA         NA
3:  3   1   2          4
4:  4   1   2          3
5: 12   7   6     13, 14
6: 13   5   6 12, 14, 15
7: 14   5   6 12, 13, 15
8: 15   5   8     13, 14

      

+1


source







All Articles