Sibling network using data.table
I would like to create a sibship network using data.table
.
My data looks like
indata <-
structure(list(id = c(1L, 2L, 3L, 4L, 12L, 13L, 14L, 15L), fid = c(NA,
9L, 1L, 1L, 7L, 5L, 5L, 5L), mid = c(0L, NA, 2L, 2L, 6L, 6L,
6L, 8L)), .Names = c("id", "fid", "mid"), class = "data.frame", row.names =
c(NA, -8L))
which the
id fid mid
1 1 NA 0
2 2 9 NA
3 3 1 2
4 4 1 2
5 12 7 6
6 13 5 6
7 14 5 6
8 15 5 8
The three columns represent the id, id of the father, and id of the mother, respectively. 0
or NA
means it is not available. So in the above data, person 3 and 4 are full brothers and sisters (they both have a father 1
and mother 2
), and 12 and 13 are half a brother (they have different fathers, but the same mother, 6
).
For each line in the dataframe, I need a list of siblings (let's just consider half of the siblings). My ideal end result would be
id fid mid sibs
1 1 NA 0 NA
2 2 9 NA NA
3 3 1 2 4
4 4 1 2 3
5 12 7 6 13, 14
6 13 5 6 12, 14, 15
7 14 5 6 12, 13, 15
8 15 5 8 13, 14
where the last column sibs
is a list or a vector (and it doesn't have to be part of the dataset).
A rough version for getting the result using R base is given below
# get a list of offspring for each father id
foffspring <- by(indata, indata$fid, function(i) { i$id }, simplify=FALSE)
# and mother id
moffspring <- by(indata, indata$mid, function(i) { i$id }, simplify=FALSE)
To get siblings to go through each id. Find their father and mother and combine the two matching entries from the previous lists.
sibs <- sapply( 1:nrow(indata), function(i) {
res <- c()
if( !is.na(indata$fid[i]) )
res <- c(res, unlist(foffspring[paste0(indata$fid[i])]))
if( !is.na(indata$mid[i]) )
res <- c(res, unlist(moffspring[paste0(indata$mid[i])]))
unique(res[res != indata$id[i]])
}, simplify=TRUE )
This creates
> sibs
[[1]]
integer(0)
[[2]]
integer(0)
[[3]]
[1] 4
[[4]]
[1] 3
[[5]]
[1] 13 14
[[6]]
[1] 14 15 12
[[7]]
[1] 13 15 12
[[8]]
[1] 13 14
which was the desired outlet. Now the code above is not fast or pretty and I would really like to know if I can get a fantastic version data.table
. However, it seems to me that data.table
-fu is missing.
library(data.table)
DT <- data.table(indata)
# Create lists with the _indices_ of the offsprings
FT <- DT[ , list( yidx = list(.I) ) , by = fid ]
MT <- DT[ , list( yidx = list(.I) ) , by = mid ]
MT
as follows
mid yidx
1: NA 2
2: 0 1
3: 2 3,4
4: 6 5,6,7
5: 8 8
Exactly as moffspring
above, except that it contains indices and not labels. This, however, is not a problem. Then I would like to concatenate the tables together
setkey(DT, fid)
setkey(FT, fid)
setkey(MT, mid)
# Inner join
P1 <- DT[FT]
# And inner join on mother
setkey(P1, mid)
P1[MT]
and now the end result looks like this:
id fid mid yidx i.yidx
1: 2 9 NA 2 2
2: 1 NA 0 1 1
3: 3 1 2 3,4 3,4
4: 4 1 2 3,4 3,4
5: 13 5 6 6,7,8 5,6,7
6: 14 5 6 6,7,8 5,6,7
7: 12 7 6 5 5,6,7
8: 15 5 8 6,7,8 8
It's almost. Now if I accept string concatenation yidx
and i.yidx
then I end up with a list of half sibs (including myself) and intersecting the strings will result in a full sibling. Note that the indices here are relative to the index in DT
, not the final data.table
, but can also be corrected.
However ... I have a feeling that something like this can be done much more efficiently in a few lines of code data.table
and a "gentle wave of the hand". Can anyone point me in the right direction?
[Sorry for the super long entry]
Update based on answers below. Just for fun, I ran three different sentences through microbenchmark
to see if there would be any time differences between the three approaches. f1()
is @Frank's suggestion, f2()
is the solution given by @mtoto, and f3
is @amatsuo_net's approach. Tried on vectors of length 1000, but the output
Unit: milliseconds
expr min lq mean median uq max neval cld
f1() 4020.8112 4387.7950 4614.7896 4498.8043 4770.1184 6837.672 100 c
f2() 656.9575 685.7706 727.5191 710.3003 735.2832 1080.423 100 a
f3() 1637.8927 1706.7528 1789.1794 1739.4428 1814.7776 2403.474 100 b
Quite a significant difference in approaches. I need to run it across a dataset with 7 million IDs, which certainly has a noticeable impact. Thanks everyone!
source to share
Here's an approach using mapply()
in combination with setdiff()
and union()
. After collecting id
into the list, we first exclude the current identifier and then the union()
lists on both sides:
setDT(indata)[,msib:=.(list(id)), by = "mid"][
,msibs := mapply(setdiff, msib, id)][
,fsib := .(list(id)), by = "fid"][
,fsibs := mapply(setdiff, fsib, id)][
,sibs := mapply(union, msibs, fsibs)][
,c("msib","msibs", "fsib", "fsibs") := NULL]
> indata
# id fid mid sibs
#1: 1 NA 0
#2: 2 9 NA
#3: 3 1 2 4
#4: 4 1 2 3
#5: 12 7 6 13,14
#6: 13 5 6 12,14,15
#7: 14 5 6 12,13,15
#8: 15 5 8 13,14
source to share
I would hold back on list columns as long as possible.
Starting with siblings, here's a simple approach:
sibDT = DT[!is.na(fid) & !is.na(mid),
CJ(id = id, sid = id)[id != sid]
, by=.(fid, mid)]
# fid mid id sid
# 1: 1 2 3 4
# 2: 1 2 4 3
# 3: 5 6 13 14
# 4: 5 6 14 13
And then define half of the siblings as the split of the parent, but not showing up in sibDT
:
hsibDT = melt(DT, id = "id")[!is.na(value),
CJ(id = id, hsid = id)[id != hsid]
, by=.(ptype = variable, pid = value)][!sibDT, on=.(id, hsid = sid)]
# ptype pid id hsid
# 1: fid 5 13 15
# 2: fid 5 14 15
# 3: fid 5 15 13
# 4: fid 5 15 14
# 5: mid 6 12 13
# 6: mid 6 12 14
# 7: mid 6 13 12
# 8: mid 6 14 12
I would stop here, but to view the results using a list column or symbol ...
DT[sibDT[, .(sibs = toString(sid)), by=id], on=.(id), sibs := i.sibs, by=.EACHI ]
DT[hsibDT[, .(hsibs = toString(hsid)), by=id], on=.(id), hsibs := i.hsibs, by=.EACHI ]
# or...
DT[
rbind(sibDT[, .(id, oid = sid)], hsibDT[, .(id, oid = hsid)])[,
.(asibs = toString(oid))
, by=.(id)],
on = .(id),
asibs := i.asibs
, by = .EACHI]
which gives
id fid mid sibs hsibs asibs
1: 1 NA 0 NA NA NA
2: 2 9 NA NA NA NA
3: 3 1 2 4 NA 4
4: 4 1 2 3 NA 3
5: 12 7 6 NA 13, 14 13, 14
6: 13 5 6 14 15, 12 14, 15, 12
7: 14 5 6 13 15, 12 13, 15, 12
8: 15 5 8 NA 13, 14 13, 14
Adding these columns to DT
is counterproductive if your analysis is not complete. I assume that any useful analysis will be contained in unlisted columns contained in various tables.
source to share
I would do with something like this.
library(data.table)
library(dplyr)
setDT(indata)
tmp <- merge(indata, indata[, 1:2], by = "fid", allow.cartesian = TRUE)
tmp2 <- merge(indata, indata, by = "mid", allow.cartesian = TRUE)
tmp3 <- rbindlist(list(tmp,tmp2), fill = T)
dt_siblings <- tmp3[id.x != id.y, unique(id.y) %>% sort() %>% paste(collapse = ", "), by = id.x][order(id.x)]
setnames(dt_siblings, 'id.x', 'id')
setnames(dt_siblings, 'V1', 'siblings')
outdata <- merge(indata, dt_siblings, all.x = T)
The idea is to combine indata
with indata
on fid
and mid
then rbindlist
them. Column id.y
in tmp3
is the siblings id (with duplicates due to full siblings). On the next line, remove the duplicate, sort and then merge. The result looks like this:
> outdata
id fid mid siblings
1: 1 NA 0 NA
2: 2 9 NA NA
3: 3 1 2 4
4: 4 1 2 3
5: 12 7 6 13, 14
6: 13 5 6 12, 14, 15
7: 14 5 6 12, 13, 15
8: 15 5 8 13, 14
source to share