The sum of all pairwise products of a series as a two-sided matrix
I am looking at some high bandwidth data and doing a type of correlation analysis based on Bayesian statistics. One of the things I need to do is find each pairwise combination of products in the dataset and find the sum of each resulting row.
So, for example, for a high throughput dataset matrix Dataset
(Dataset <- structure(list(`Condition 1` = c(1L, 3L, 2L, 2L), `Condition 2` = c(2L, 1L, 7L, 2L), `Condition 3` = c(4L, 1L, 2L, 5L)), .Names = c("Condition 1", "Condition 2", "Condition 3"), class = "data.frame", row.names = c("Gene A", "Gene B", "Gene C", "Gene D")))
Condition 1 Condition 2 Condition 3
Gene A 1 2 4
Gene B 3 1 1
Gene C 2 7 2
Gene D 2 2 5
First, I want to multiply all possible row pairs to get the following matrix named Comb
:
Condition 1 Condition 2 Condition 3
Gene A Gene A 1 4 9
Gene A Gene B 3 2 4
Gene A Gene C 2 14 8
Gene A Gene D 2 4 20
Gene B Gene B 9 1 1
Gene B Gene C 6 7 2
Gene B Gene D 6 2 5
Gene C Gene C 4 49 4
Gene C Gene D 4 14 10
Gene D Gene D 4 4 25
After I want to find the row sums for each product and get the sums as a matrix (which I will call CombSums
):
Gene A Gene B Gene C Gene D
Gene A NA 10 24 26
Gene B 10 NA 15 13
Gene C 24 15 NA 28
Gene D 26 13 28 NA
When I tried to do it, the best I could think of was
combs <- combn(seq_len(nrow(Dataset)), 2)
Comb <- Dataset[combs[1,], ] * Dataset[combs[2,], ]
rownames(Comb) <- apply(combn(rownames(Comb), 2), 2, paste, collapse = " ")
CombSums <- rowSums(Comb)
Which will give me the sums in a list like below:
[1,]
Gene A Gene B 10
Gene A Gene C 24
Gene A Gene D 26
Gene B Gene C 15
Gene B Gene D 13
Gene C Gene D 28
Unfortunately, I want it to be a two-sided matrix, not a list, so this doesn't work, so if anyone can suggest a way to get the sums as a matrix it would be a big help.
source to share
If speed is an important factor (for example, if you are handling a huge matrix), you might find a useful Rcpp implementation. This only fills the top triangular part of the matrix.
library(Rcpp)
cppFunction(
"NumericMatrix josilberRcpp(NumericMatrix x) {
const int nr = x.nrow();
const int nc = x.ncol();
NumericMatrix y(nr, nr);
for (int col=0; col < nc; ++col) {
for (int i=0; i < nr; ++i) {
for (int j=i; j < nr; ++j) {
y(i, j) += x(i, col) * x(j, col);
}
}
}
return y;
}")
josilberRcpp(as.matrix(Dataset))
# [,1] [,2] [,3] [,4]
# [1,] 21 9 24 26
# [2,] 0 11 15 13
# [3,] 0 0 57 28
# [4,] 0 0 0 33
My other answer provides benchmarking. Note that benchmarking does not include compile time using cppFunction
, which can be quite significant. Therefore, this implementation is probably only useful for very large inputs or when you need to reuse this function.
source to share
You can do this by calculating the paired products for each column in the original data frame with lapply
and outer
, and then you can add all those paired products along with Reduce
and +
.
Reduce("+", lapply(dat, function(x) outer(x, x)))
# [,1] [,2] [,3] [,4]
# [1,] 21 9 24 26
# [2,] 9 11 15 13
# [3,] 24 15 57 28
# [4,] 26 13 28 33
A variation on this topic that is less memory-intensive (because it doesn't need to store every column matrix at the same time), but more typing would be:
ret <- outer(dat[,1], dat[,1])
for (i in 2:ncol(dat)) ret <- ret + outer(dat[,i], dat[,i])
ret
# [,1] [,2] [,3] [,4]
# [1,] 21 9 24 26
# [2,] 9 11 15 13
# [3,] 24 15 57 28
# [4,] 26 13 28 33
Here's a comparison of the approaches suggested so far in a 100 x 100 dataframe:
# Larger dataset
set.seed(144)
dat <- as.data.frame(matrix(rnorm(10000), nrow=100))
josilber <- function(dat) Reduce("+", lapply(dat, function(x) outer(x, x)))
josilber2 <- function(dat) {
ret <- outer(dat[,1], dat[,1])
for (i in 2:ncol(dat)) ret <- ret + outer(dat[,i], dat[,i])
ret
}
frank <- function(DF) {
mat <- as.matrix(DF)
pairs <- combn(1:nrow(DF),2)
vals <- rowSums(mat[pairs[1,],]*mat[pairs[2,],])
res <- matrix(,nrow(DF),nrow(DF),dimnames=list(rownames(DF),rownames(DF)))
res[lower.tri(res)] <- vals
res
}
library(microbenchmark)
microbenchmark(josilber(dat), josilber2(dat), josilberRcpp(as.matrix(dat)), frank(dat))
# Unit: microseconds
# expr min lq mean median uq max neval
# josilber(dat) 6867.499 45437.277 45506.731 46372.576 47549.834 85494.063 100
# josilber2(dat) 6831.692 7982.539 10245.459 9109.023 10883.965 50612.600 100
# josilberRcpp(as.matrix(dat)) 989.592 1112.316 1290.617 1204.388 1483.638 2384.348 100
# frank(dat) 13043.912 53369.804 52488.997 53921.402 54855.583 62566.730 100
source to share
By using combn
, you can avoid redundant computation:
mat <- as.matrix(DF)
pairs <- combn(1:nrow(DF),2)
vals <- rowSums(mat[pairs[1,],]*mat[pairs[2,],])
res <- matrix(,nrow(DF),nrow(DF),dimnames=list(rownames(DF),rownames(DF)))
res[lower.tri(res)] <- vals
# GeneA GeneB GeneC GeneD
# GeneA NA NA NA NA
# GeneB 9 NA NA NA
# GeneC 24 15 NA NA
# GeneD 26 13 28 NA
Your matrix Comb
is an intermediate result mat[pairs[1,],]*mat[pairs[2,],]
.
The whole calculation can be done inside combn
, one by one:
vals <- combn(rownames(DF),2,FUN=function(x)sum(apply(DF[x,],2,prod)))
As @josilber pointed out in the comment below, this is incredibly slow.
Data:
DF <- read.table(header=TRUE,text="Condition1 Condition2 Condition3
GeneA 1 2 4
GeneB 3 1 1
GeneC 2 7 2
GeneD 2 2 5")
source to share