R - Link sparse matrices of different sizes by rows

Question

R - Link sparse matrices of different sizes by rows

I am trying to use the Matrix package to link two sparse matrices of different sizes. Linking is done row by row using the column names to match.

Table A:

ID     | AAAA   | BBBB   |
------ | ------ | ------ |
XXXX   | 1      | 2      |

Table B:

ID     | BBBB   | CCCC   |
------ | ------ | ------ |
YYYY   | 3      | 4      |

Binding table A and B :

ID     | AAAA   | BBBB   | CCCC   |
------ | ------ | ------ | ------ |
XXXX   | 1      | 2      |        |
YYYY   |        | 3      | 4      |

The goal is to insert a large number of small matrices into one large matrix in order to enable continuous query and update / insert.

I found that neither Matrix nor slam packages have the functionality to handle this.

Similar questions have been asked in the past, but no solution seems to be found:

Message 1: in-r-when-using-named-rows-can-a-sparse-matrix-column-be-added-concatenated

Message 2: bind-together-sparse-model-matrices-by-row-names

Ideas on how to solve it would be much appreciated.

Respectfully,

Frederick

+4

r sparse-matrix

Frederik Andersen Mar 30 17 at 12:15

source to share

5 answers

For my purposes (very sparse matrix with millions of rows and tens of thousands of columns, over 99.9% empty) it was still too slow. What worked was the code below - might be useful to others too:

merge.sparse = function(listMatrixes) {
  # takes a list of sparse matrixes with different columns and adds them row wise

  allColnames <- sort(unique(unlist(lapply(listMatrixes,colnames))))
  for (currentMatrix in listMatrixes) {
    newColLocations <- match(colnames(currentMatrix),allColnames)
    indexes <- which(currentMatrix>0, arr.ind = T)
    newColumns <- newColLocations[indexes[,2]]
    rows <- indexes[,1]
    newMatrix <- sparseMatrix(i=rows,j=newColumns, x=currentMatrix@x,
                              dims=c(max(rows),length(allColnames)))
    if (!exists("matrixToReturn")) {
      matrixToReturn <- newMatrix
    }
    else {
      matrixToReturn <- rbind2(matrixToReturn,newMatrix)
    }
  }
  colnames(matrixToReturn) <- allColnames
  matrixToReturn  
}

+3

Valentin 08 Sep 18 at 14:15

source to share

We can create an empty sparse matrix that has all rows and columns, and then insert values into it using the subset assignment:

my.bind = function(A, B){
  C = Matrix(0, nrow = NROW(A) + NROW(B), ncol = length(union(colnames(A), colnames(B))), 
             dimnames = list(c(rownames(A), rownames(B)), union(colnames(A), colnames(B))))
  C[rownames(A), colnames(A)] = A
  C[rownames(B), colnames(B)] = B
  return(C)
}

my.bind(A,B)
# 2 x 3 sparse Matrix of class "dgCMatrix"
#      AAAA BBBB CCCC
# XXXX    1    2    .
# YYYY    .    3    4

Note that the above assumes that A and B do not separate row names. If there are shared line names, you must use line numbers instead of names for assignment.

Data:

library(Matrix)
A = Matrix(c(1,2), 1, dimnames = list('XXXX', c('AAAA','BBBB')))
B = Matrix(c(3,4), 1, dimnames = list('YYYY', c('BBBB','CCCC')))

0

dww Mar 30 17 at 15:10

source to share

If you need to concatenate / concatenate many small sparse matrices into one large sparse matrix, it is much better and more efficient to use global and local row and column index mapping to build a large sparse matrix. For example.

globalInds <- matrix(NA, nrow=dim(localPairRowColInds)[1], 2)

# extract the corresponding global row indices for the local row indices
globalInds[ , 1] <- globalRowInds[ localPairRowColInds[,1] ] 
globalInds[ , 2] <- globalColInds[ localPairRowColInds[,2] ]

write.table(cbind(globalInds, localPairVals), file=dataFname, append = T, sep = " ", row.names = F, col.names = F)

0

Good will 04 jan. At 10:14

source to share

Starting with Valentine's answer above, I created my own merge.sparse function to achieve the following:

keep the column and row names (and of course take them into account when concatenating)
keep the original order of row and column names, combining only common ones

The code below seems to do it:

if (length(find.package(package="Matrix",quiet=TRUE))==0) install.packages("Matrix")
require(Matrix)

merge.sparse <- function(...) {

  cnnew <- character()
  rnnew <- character()
  x <- vector()
  i <- numeric()
  j <- numeric()

  for (M in list(...)) {

  cnold <- colnames(M)
  rnold <- rownames(M)

  cnnew <- union(cnnew,cnold)
  rnnew <- union(rnnew,rnold)

  cindnew <- match(cnold,cnnew)
  rindnew <- match(rnold,rnnew)
  ind <- unname(which(M != 0,arr.ind=T))
  i <- c(i,rindnew[ind[,1]])
  j <- c(j,cindnew[ind[,2]])
  x <- c(x,M@x)
  }

  sparseMatrix(i=i,j=j,x=x,dims=c(length(rnnew),length(cnnew)),dimnames=list(rnnew,cnnew))
}

I have verified this with the following details:

df1 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("N","N","M","X","X","Z","Z"))
M1 <- xtabs(~y+x,df1,sparse=T)
df2 <- data.frame(x=c("S","S","T","T","U","V","V","W","W","X"),y=c("N","M","M","K","Z","M","N","N","K","Z"))
M2 <- xtabs(~y+x,df2,sparse=T)
df3 <- data.frame(x=c("A","C","C","B"),y=c("N","M","Z","K"))
M3 <- xtabs(~y+x,df3,sparse=T)
df4 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("F","F","G","G","H","I","L"))
M4 <- xtabs(~y+x,df4,sparse=T)
df5 <- data.frame(x=c("K1","K2","K3","K4"),y=c("J1","J2","J3","J4"))
M5 <- xtabs(~y+x,df5,sparse=T)

Which gave:

Ms <- merge.sparse(M1,M2,M3,M4,M5)
as.matrix(Ms)
#   N R S T U V W X A B C K1 K2 K3 K4
#M  0 1 1 1 0 1 0 0 0 0 1  0  0  0  0
#N  1 1 1 0 0 1 1 0 1 0 0  0  0  0  0
#X  0 0 1 1 0 0 0 0 0 0 0  0  0  0  0
#Z  0 0 0 1 2 0 0 1 0 0 1  0  0  0  0
#K  0 0 0 1 0 0 1 0 0 1 0  0  0  0  0
#F  1 1 0 0 0 0 0 0 0 0 0  0  0  0  0
#G  0 1 1 0 0 0 0 0 0 0 0  0  0  0  0
#H  0 0 0 1 0 0 0 0 0 0 0  0  0  0  0
#I  0 0 0 1 0 0 0 0 0 0 0  0  0  0  0
#L  0 0 0 0 1 0 0 0 0 0 0  0  0  0  0
#J1 0 0 0 0 0 0 0 0 0 0 0  1  0  0  0
#J2 0 0 0 0 0 0 0 0 0 0 0  0  1  0  0
#J3 0 0 0 0 0 0 0 0 0 0 0  0  0  1  0
#J4 0 0 0 0 0 0 0 0 0 0 0  0  0  0  1
Ms
#14 x 15 sparse Matrix of class "dgCMatrix"
#   [[ suppressing 15 column names ‘N, ‘R, ‘S ... ]]
#                                
#M  . 1 1 1 . 1 . . . . 1 . . . .
#N  1 1 1 . . 1 1 . 1 . . . . . .
#X  . . 1 1 . . . . . . . . . . .
#Z  . . . 1 2 . . 1 . . 1 . . . .
#K  . . . 1 . . 1 . . 1 . . . . .
#F  1 1 . . . . . . . . . . . . .
#G  . 1 1 . . . . . . . . . . . .
#H  . . . 1 . . . . . . . . . . .
#I  . . . 1 . . . . . . . . . . .
#L  . . . . 1 . . . . . . . . . .
#J1 . . . . . . . . . . . 1 . . .
#J2 . . . . . . . . . . . . 1 . .
#J3 . . . . . . . . . . . . . 1 .
#J4 . . . . . . . . . . . . . . 1

I don't know why the column names are "suppressed" when trying to display a merged sparse matrix Ms

; converting to a non-sparse matrix returns them, so ...

Also, I noticed that when the same "coordinates" are included multiple times, the sparse matrix contains the sum of the corresponding values in x

(see row "Z", column "U" which is 1 as in M1

and in M2

). Maybe there is a way to change this, but this is fine for my applications.

I thought I'd share this code in case anyone else needs to concatenate sparse matrices this way, and if someone can test it on large matrices and suggest performance improvements.

EDIT

After checking this post, I found that retrieving information about the (non-zero) elements of a sparse matrix can be done much easier summary

without using which

.

So this part of my code above:

ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)

can be replaced with:

ind <- summary(M)
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,ind[,3])

Now I don't know which one is computationally more efficient, or is there an even easier way to do this by resizing the matrices and then just summing them, but it seems to me that it works, so ...

0

user6376297 June 11 '19 at 15:29

source to share

IBrum · Accepted Answer · 2017-03-30T13:52:54+0000

It looks like empty columns have been added for matrices (columns with 0) to be compatible for rbind

(matrices with the same column names and in the same order). The following code does it:

# dummy data
set.seed(3344)
A = Matrix(matrix(rbinom(16, 2, 0.2), 4))
colnames(A)=letters[1:4]
B = Matrix(matrix(rbinom(9, 2, 0.2), 3))
colnames(B) = letters[3:5]

# finding what missing
misA = colnames(B)[!colnames(B) %in% colnames(A)]
misB = colnames(A)[!colnames(A) %in% colnames(B)]

misAl = as.vector(numeric(length(misA)), "list")
names(misAl) = misA
misBl = as.vector(numeric(length(misB)), "list")
names(misBl) = misB

## adding missing columns to initial matrices
An = do.call(cbind, c(A, misAl))
Bn = do.call(cbind, c(B, misBl))[,colnames(An)]

# final bind
rbind(An, Bn)

R - Link sparse matrices of different sizes by rows

EDIT

More articles: