R (data.table): Fast counting values ​​matches multiple columns

Is there a quick way to count the number of times that a value that appears in one of several vectors also appears in several other vectors? Here's an example:

library(data.table)
names<-c(rep('apple',4),rep('banana',3),rep('cantalope',2),'date')
set.seed(38291)
v1<-data.table(municipality=rep('A',6),village=rep('1',6),
               last=sample(names,6,replace=TRUE),
               middle=sample(names,6,replace=TRUE),id=c(1:6))
v2<-data.table(municipality=rep('A',4),village=rep('2',4),
               last=sample(names,4,replace=TRUE),
               middle=sample(names,4,replace=TRUE),id=c(7:10))
v1
#    municipality village      last    middle id
# 1:            A       1    banana cantalope  1
# 2:            A       1 cantalope    banana  2
# 3:            A       1 cantalope cantalope  3
# 4:            A       1     apple     apple  4
# 5:            A       1    banana     apple  5
# 6:            A       1     apple     apple  6
v2
#    municipality village      last    middle id
# 1:            A       2      date cantalope  7
# 2:            A       2     apple      date  8
# 3:            A       2 cantalope    banana  9
# 4:            A       2     apple cantalope 10
DT = rbind(v1, v2)

      

I want to calculate the number of family ties that intersect between individuals in village 1 and village 2. Family ties between villages are determined by whether the last person or middle name ("last" or "middle") matches the last or middle name in another village. In this example, the person with id = 1 who lives in village 1 has three family members in village 2 (with go 7, 9 and 10) because he has at least one name with them. Next, I want to create a dyadic village dataset where the links between villages are determined by the number of family links that these villages cross. So, in this example, the resulting dataset will look like this:

dyads<-data.table(v1='1',v2='2',ties=3+3+3+2+3+2)
dyads
   v1 v2 ties
1:  1  2   16

      

Is there an efficient way to calculate this link number? I wrote an inefficient loop for this, but I have a massive dataset (~ 50 million people in 40,000 villages).

+3


source to share


2 answers


Frank-inspired update:



meltDT = 
  #use unique to eliminate last+middle duplication
  unique(melt(DT, measure.vars = c('last', 'middle'), 
              id.vars = c('village', 'id'), value.name = 'name'),
         by = c('village', 'id', 'name'))

#framework of output -- one row for each pair of villages
out.dt = with(DT, CJ(village, village, unique = TRUE))[V2 > V1]

setkey(meltDT, village)
setindex(meltDT, name)
#set indices to facilitate merges on names
out.dt[ , {
  ties := 
    #unique here eliminates matching on both last & middle
    uniqueN(meltDT[.(.BY$V1)][meltDT[.(.BY$V2)], on = 'name', 
                              allow.cartesian = TRUE, nomatch = 0L],
            by = c('id', 'i.id'))
}, by = .(V1, V2)]
out.dt
#    V1 V2 ties
# 1:  1  2   16

      

+6


source


This extends to 3+ villages, but will be quite slow:

DT = rbind(v1, v2)

matches = melt(DT, id="id", measure.vars=c("middle","last"))[, 
  CJ(id1 = id, id2 = id)[id1 < id2]
, by=value]

matches[DT, on=.(id1 = id), v1 := i.village ]
matches[DT, on=.(id2 = id), v2 := i.village ]

unique(matches[, !"value"])[v2 != v1, .N, by=.(v1, v2)]
#    v1 v2  N
# 1:  1  2 16

      



So it finds people that match (even if they are in the same village), and the OP's desired outcome is just a summary figure calculated with that set of matches.

+3


source







All Articles