R Matrix package: Demean sparse matrix

Is there an easy way to unify a sparse matrix across columns while treating null values ​​as missing (using the Matrix package)?

There seem to be two problems that I am struggling with:

Finding a matching column means

Empty cells are considered null, not missing:

M0 <- matrix(rep(1:5,4),nrow = 4)
M0[2,2] <- M0[2,3] <- 0
M <- as(M0, "sparseMatrix")
M
#[1,] 1 5 4 3 2
#[2,] 2 . . 4 3
#[3,] 3 2 1 5 4
#[4,] 4 3 2 1 5
colMeans(M)
#[1] 2.50 2.50 1.75 3.25 3.50

      

The correct result should be:

colMeans_correct <- colSums(M) / c(4,3,3,4,4)
colMeans_correct
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000

      

Subtract the middle column

Subtraction is also performed on missing cells:

sweep(M, 2, colMeans_correct)
#4 x 5 Matrix of class "dgeMatrix"
#     [,1]       [,2]       [,3]  [,4] [,5]
#[1,] -1.5  1.6666667  1.6666667 -0.25 -1.5
#[2,] -0.5 -3.3333333 -2.3333333  0.75 -0.5
#[3,]  0.5 -1.3333333 -1.3333333  1.75  0.5
#[4,]  1.5 -0.3333333 -0.3333333 -2.25  1.5

      

PS hope this is not a problem asking a question that has two problems. They are related to the same task and seem to reflect the same problem - distinguishing between missing and actual zero values.

+3


source to share


1 answer


One option is to divide colSums

by a colSums

nonzero logical matrix

colSums(M)/colSums(M!=0)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000

      

Or another option is to replace 0 with NA

and get colMeans

with an argumentna.rm = TRUE

colMeans(M*NA^!M, na.rm = TRUE)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000

      




Or as @ user20650 commented

colSums(M) / diff(M@p)
#[1] 2.500000 3.333333 2.333333 3.250000 3.500000

      

where 'p' is the pointer mentioned in ?sparseMatrix

In typical usage, p is missing, i and j are vectors of positive integers, and x is a numeric vector. These three vectors, which must be of the same length, form a triplet representation of the sparse matrix.

If i or j is missing, then p must be a non-decreasing integer vector whose first element is zero. It provides a compressed, or "pointer", representation of row or column indexes, whichever is missing. extended form p, rep (seq_along (dp), dp), where dp <- diff (p), is used as row or column indices (1 based).

+3


source







All Articles