R horizontal coincidence matrix data
R newbie. I am trying to create a horizontal data match matrix. I want to know which elements together occur "TRUE" together in strings.
Each line represents an article. Each article has many true / false variables indicating the presence or absence of an item. There are 100 items abbreviated here and over 10k articles. So the data frame is 10,000 x 101.
dat <- read.table(text='"article" "element1" "element2" "element3" "element4"
1 "a house a home" "TRUE" "TRUE" "FALSE" "FALSE"
2 "cabin in the woods" "TRUE" "TRUE" "FALSE" "FALSE"
3 "motel is a hotel" "TRUE" "FALSE" "TRUE" "FALSE"', header=TRUE)
I tried this co-occurrence question ( Creating a Match Matrix ), but it looks like since the data is organized differently, this approach doesn't work.
What would be useful would be a matrix if there are 100 elements x 100 elements. Anyone have any suggestions?
source to share
Matrix's rare answer in your linked question provides a quick and easy way to do this. It's (somewhat) easier to do with your data structure.
# Make a vector of all elements.
elems <- colnames(dat)[-1]
# Make a sparse matrix
library(Matrix)
s <- Matrix(as.matrix(dat[elems]), sparse=TRUE, dimnames=list(dat$article,elems))
# calculate co-occurrences
(t(s) %*% s)
# 4 x 4 sparse Matrix of class "dgCMatrix"
# element1 element2 element3 element4
# element1 3 2 1 .
# element2 2 2 . .
# element3 1 . 1 .
# element4 . . . .
# If you don't want the exact number, and you want a "dense" matrix
as.matrix((t(s) %*% s) >= 1)
# element1 element2 element3 element4
# element1 TRUE TRUE TRUE FALSE
# element2 TRUE TRUE FALSE FALSE
# element3 TRUE FALSE TRUE FALSE
# element4 FALSE FALSE FALSE FALSE
source to share
This looks pretty fast:
mat <- matrix(0,ncol=ncol(dat[-1]),nrow=ncol(dat[-1]))
res <- combn(colnames(dat[-1]), 2,
FUN=function(x) sum(pmin(dat[x[1]],dat[x[2]])==1) )
mat[lower.tri(mat)] <- res
mat[upper.tri(mat)] <- res
mat
# [,1] [,2] [,3] [,4]
#[1,] 0 2 1 0
#[2,] 2 0 0 0
#[3,] 1 0 0 0
#[4,] 0 0 0 0
source to share