Cross-column comparison of the same frame.
I have a data.frame that looks like this:
> DF1
A B C D E
a x c h p
c d q t w
s e r p a
w l t s i
p i y a f
I would like to compare each column of my data.frame with the rest of the columns in order to count the number of common items. For example, I would like to compare column A with all other columns (B, C, D, E) and count the common objects like this:
A versus the rest:
- A vs B: 0 (because they have 0 elements in common)
- A vs C: 1 (c together)
- A vs D: 2 (p and s together)
- A vs E: 3 (p, w, a, together)
Then the same: B versus columns C, D, E, etc.
Can anyone help me? I don't know how to implement this.
source to share
We can iterate over the column names and compare them to other columns by taking intersect
and gettinglength
sapply(names(DF1), function(x) {
x1 <- lengths(Map(intersect, DF1[setdiff(names(DF1), x)], DF1[x]))
c(x1, setNames(0, setdiff(names(DF1), names(x1))))[names(DF1)]})
# A B C D E
#A 0 0 1 3 3
#B 0 0 0 0 1
#C 1 0 0 1 0
#D 3 0 1 0 2
#E 3 1 0 2 0
Or it can be done more compactly by taking the cross product after getting the long format frequency ( melt
) dataset
library(reshape2)
tcrossprod(table(melt(as.matrix(DF1))[-1])) * !diag(5)
# Var2
#Var2 A B C D E
# A 0 0 1 3 3
# B 0 0 0 0 1
# C 1 0 0 1 0
# D 3 0 1 0 2
# E 3 1 0 2 0
NOTE. A part is crossprod
also implemented from RcppEigen
here that will make it faster
source to share
An alternative is to use it combn
twice, once, to get the combinations of columns and find the lengths of the intersections of the elements.
cbind.data.frame
returns data.frame and is setNames
used to add column names.
setNames(cbind.data.frame(t(combn(names(df), 2)),
combn(names(df), 2, function(x) length(intersect(df[, x[1]], df[, x[2]])))),
c("col1", "col2", "count"))
col1 col2 count
1 A B 0
2 A C 1
3 A D 3
4 A E 3
5 B C 0
6 B D 0
7 B E 1
8 C D 1
9 C E 0
10 D E 2
source to share