The sum of all pairwise products of a series as a two-sided matrix

I am looking at some high bandwidth data and doing a type of correlation analysis based on Bayesian statistics. One of the things I need to do is find each pairwise combination of products in the dataset and find the sum of each resulting row.

So, for example, for a high throughput dataset matrix Dataset

(Dataset <- structure(list(`Condition 1` = c(1L, 3L, 2L, 2L), `Condition 2` = c(2L, 1L, 7L, 2L), `Condition 3` = c(4L, 1L, 2L, 5L)), .Names = c("Condition 1", "Condition 2", "Condition 3"), class = "data.frame", row.names = c("Gene A", "Gene B", "Gene C", "Gene D")))
       Condition 1 Condition 2   Condition 3
Gene A           1           2             4
Gene B           3           1             1
Gene C           2           7             2
Gene D           2           2             5

      

First, I want to multiply all possible row pairs to get the following matrix named Comb

:

              Condition 1 Condition 2 Condition 3
Gene A Gene A           1           4           9
Gene A Gene B           3           2           4
Gene A Gene C           2          14           8
Gene A Gene D           2           4          20
Gene B Gene B           9           1           1
Gene B Gene C           6           7           2
Gene B Gene D           6           2           5
Gene C Gene C           4          49           4
Gene C Gene D           4          14          10
Gene D Gene D           4           4          25

      

After I want to find the row sums for each product and get the sums as a matrix (which I will call CombSums

):

            Gene A       Gene B      Gene C      Gene D 
Gene A          NA           10          24          26
Gene B          10           NA          15          13
Gene C          24           15          NA          28
Gene D          26           13          28          NA

      

When I tried to do it, the best I could think of was

combs <- combn(seq_len(nrow(Dataset)), 2)
Comb <- Dataset[combs[1,], ] * Dataset[combs[2,], ]
rownames(Comb) <- apply(combn(rownames(Comb), 2), 2, paste, collapse = " ")
CombSums <- rowSums(Comb)

      

Which will give me the sums in a list like below:

                    [1,]
Gene A Gene B       10
Gene A Gene C       24 
Gene A Gene D       26 
Gene B Gene C       15
Gene B Gene D       13
Gene C Gene D       28

      

Unfortunately, I want it to be a two-sided matrix, not a list, so this doesn't work, so if anyone can suggest a way to get the sums as a matrix it would be a big help.

+3


source to share


3 answers


If speed is an important factor (for example, if you are handling a huge matrix), you might find a useful Rcpp implementation. This only fills the top triangular part of the matrix.

library(Rcpp)
cppFunction(
 "NumericMatrix josilberRcpp(NumericMatrix x) {
   const int nr = x.nrow();
   const int nc = x.ncol();
   NumericMatrix y(nr, nr);
   for (int col=0; col < nc; ++col) {
    for (int i=0; i < nr; ++i) {
      for (int j=i; j < nr; ++j) {
        y(i, j) += x(i, col) * x(j, col);
      }
    }
   }
   return y;
}")
josilberRcpp(as.matrix(Dataset))
#      [,1] [,2] [,3] [,4]
# [1,]   21    9   24   26
# [2,]    0   11   15   13
# [3,]    0    0   57   28
# [4,]    0    0    0   33

      



My other answer provides benchmarking. Note that benchmarking does not include compile time using cppFunction

, which can be quite significant. Therefore, this implementation is probably only useful for very large inputs or when you need to reuse this function.

+2


source


You can do this by calculating the paired products for each column in the original data frame with lapply

and outer

, and then you can add all those paired products along with Reduce

and +

.

Reduce("+", lapply(dat, function(x) outer(x, x)))
#      [,1] [,2] [,3] [,4]
# [1,]   21    9   24   26
# [2,]    9   11   15   13
# [3,]   24   15   57   28
# [4,]   26   13   28   33

      

A variation on this topic that is less memory-intensive (because it doesn't need to store every column matrix at the same time), but more typing would be:



ret <- outer(dat[,1], dat[,1])
for (i in 2:ncol(dat)) ret <- ret + outer(dat[,i], dat[,i])
ret
#      [,1] [,2] [,3] [,4]
# [1,]   21    9   24   26
# [2,]    9   11   15   13
# [3,]   24   15   57   28
# [4,]   26   13   28   33

      

Here's a comparison of the approaches suggested so far in a 100 x 100 dataframe:

# Larger dataset
set.seed(144)
dat <- as.data.frame(matrix(rnorm(10000), nrow=100))

josilber <- function(dat) Reduce("+", lapply(dat, function(x) outer(x, x)))
josilber2 <- function(dat) {
  ret <- outer(dat[,1], dat[,1])
  for (i in 2:ncol(dat)) ret <- ret + outer(dat[,i], dat[,i])
  ret
}
frank <- function(DF) {
  mat   <- as.matrix(DF)
  pairs <- combn(1:nrow(DF),2)
  vals  <- rowSums(mat[pairs[1,],]*mat[pairs[2,],])
  res   <- matrix(,nrow(DF),nrow(DF),dimnames=list(rownames(DF),rownames(DF)))
  res[lower.tri(res)] <- vals
  res
}

library(microbenchmark)
microbenchmark(josilber(dat), josilber2(dat), josilberRcpp(as.matrix(dat)), frank(dat))
# Unit: microseconds
#                          expr       min        lq      mean    median        uq       max neval
#                 josilber(dat)  6867.499 45437.277 45506.731 46372.576 47549.834 85494.063   100
#                josilber2(dat)  6831.692  7982.539 10245.459  9109.023 10883.965 50612.600   100
#  josilberRcpp(as.matrix(dat))   989.592  1112.316  1290.617  1204.388  1483.638  2384.348   100
#                    frank(dat) 13043.912 53369.804 52488.997 53921.402 54855.583 62566.730   100

      

+4


source


By using combn

, you can avoid redundant computation:

mat   <- as.matrix(DF)

pairs <- combn(1:nrow(DF),2)

vals  <- rowSums(mat[pairs[1,],]*mat[pairs[2,],])
res   <- matrix(,nrow(DF),nrow(DF),dimnames=list(rownames(DF),rownames(DF)))
res[lower.tri(res)] <- vals

#       GeneA GeneB GeneC GeneD
# GeneA    NA    NA    NA    NA
# GeneB     9    NA    NA    NA
# GeneC    24    15    NA    NA
# GeneD    26    13    28    NA

      

Your matrix Comb

is an intermediate result mat[pairs[1,],]*mat[pairs[2,],]

.


The whole calculation can be done inside combn

, one by one:

vals <- combn(rownames(DF),2,FUN=function(x)sum(apply(DF[x,],2,prod)))

      

As @josilber pointed out in the comment below, this is incredibly slow.


Data:

DF <- read.table(header=TRUE,text="Condition1 Condition2   Condition3
GeneA           1           2             4
GeneB           3           1             1
GeneC           2           7             2
GeneD           2           2             5")

      

+4


source







All Articles