R - Link sparse matrices of different sizes by rows

I am trying to use the Matrix package to link two sparse matrices of different sizes. Linking is done row by row using the column names to match.

Table A:

ID     | AAAA   | BBBB   |
------ | ------ | ------ |
XXXX   | 1      | 2      |

      

Table B:

ID     | BBBB   | CCCC   |
------ | ------ | ------ |
YYYY   | 3      | 4      |

      

Binding table A and B :

ID     | AAAA   | BBBB   | CCCC   |
------ | ------ | ------ | ------ |
XXXX   | 1      | 2      |        |
YYYY   |        | 3      | 4      |

      

The goal is to insert a large number of small matrices into one large matrix in order to enable continuous query and update / insert.

I found that neither Matrix nor slam packages have the functionality to handle this.

Similar questions have been asked in the past, but no solution seems to be found:

Message 1: in-r-when-using-named-rows-can-a-sparse-matrix-column-be-added-concatenated

Message 2: bind-together-sparse-model-matrices-by-row-names

Ideas on how to solve it would be much appreciated.

Respectfully,

Frederick

+4


source to share


5 answers


It looks like empty columns have been added for matrices (columns with 0) to be compatible for rbind

(matrices with the same column names and in the same order). The following code does it:



# dummy data
set.seed(3344)
A = Matrix(matrix(rbinom(16, 2, 0.2), 4))
colnames(A)=letters[1:4]
B = Matrix(matrix(rbinom(9, 2, 0.2), 3))
colnames(B) = letters[3:5]

# finding what missing
misA = colnames(B)[!colnames(B) %in% colnames(A)]
misB = colnames(A)[!colnames(A) %in% colnames(B)]

misAl = as.vector(numeric(length(misA)), "list")
names(misAl) = misA
misBl = as.vector(numeric(length(misB)), "list")
names(misBl) = misB

## adding missing columns to initial matrices
An = do.call(cbind, c(A, misAl))
Bn = do.call(cbind, c(B, misBl))[,colnames(An)]

# final bind
rbind(An, Bn)

      

+3


source


For my purposes (very sparse matrix with millions of rows and tens of thousands of columns, over 99.9% empty) it was still too slow. What worked was the code below - might be useful to others too:



merge.sparse = function(listMatrixes) {
  # takes a list of sparse matrixes with different columns and adds them row wise

  allColnames <- sort(unique(unlist(lapply(listMatrixes,colnames))))
  for (currentMatrix in listMatrixes) {
    newColLocations <- match(colnames(currentMatrix),allColnames)
    indexes <- which(currentMatrix>0, arr.ind = T)
    newColumns <- newColLocations[indexes[,2]]
    rows <- indexes[,1]
    newMatrix <- sparseMatrix(i=rows,j=newColumns, x=currentMatrix@x,
                              dims=c(max(rows),length(allColnames)))
    if (!exists("matrixToReturn")) {
      matrixToReturn <- newMatrix
    }
    else {
      matrixToReturn <- rbind2(matrixToReturn,newMatrix)
    }
  }
  colnames(matrixToReturn) <- allColnames
  matrixToReturn  
}

      

+3


source


We can create an empty sparse matrix that has all rows and columns, and then insert values ​​into it using the subset assignment:

my.bind = function(A, B){
  C = Matrix(0, nrow = NROW(A) + NROW(B), ncol = length(union(colnames(A), colnames(B))), 
             dimnames = list(c(rownames(A), rownames(B)), union(colnames(A), colnames(B))))
  C[rownames(A), colnames(A)] = A
  C[rownames(B), colnames(B)] = B
  return(C)
}

my.bind(A,B)
# 2 x 3 sparse Matrix of class "dgCMatrix"
#      AAAA BBBB CCCC
# XXXX    1    2    .
# YYYY    .    3    4

      

Note that the above assumes that A and B do not separate row names. If there are shared line names, you must use line numbers instead of names for assignment.

Data:

library(Matrix)
A = Matrix(c(1,2), 1, dimnames = list('XXXX', c('AAAA','BBBB')))
B = Matrix(c(3,4), 1, dimnames = list('YYYY', c('BBBB','CCCC')))

      

0


source


If you need to concatenate / concatenate many small sparse matrices into one large sparse matrix, it is much better and more efficient to use global and local row and column index mapping to build a large sparse matrix. For example.

globalInds <- matrix(NA, nrow=dim(localPairRowColInds)[1], 2)

# extract the corresponding global row indices for the local row indices
globalInds[ , 1] <- globalRowInds[ localPairRowColInds[,1] ] 
globalInds[ , 2] <- globalColInds[ localPairRowColInds[,2] ]

write.table(cbind(globalInds, localPairVals), file=dataFname, append = T, sep = " ", row.names = F, col.names = F)

      

0


source


Starting with Valentine's answer above, I created my own merge.sparse function to achieve the following:

  1. keep the column and row names (and of course take them into account when concatenating)
  2. keep the original order of row and column names, combining only common ones

The code below seems to do it:

if (length(find.package(package="Matrix",quiet=TRUE))==0) install.packages("Matrix")
require(Matrix)

merge.sparse <- function(...) {

  cnnew <- character()
  rnnew <- character()
  x <- vector()
  i <- numeric()
  j <- numeric()

  for (M in list(...)) {

  cnold <- colnames(M)
  rnold <- rownames(M)

  cnnew <- union(cnnew,cnold)
  rnnew <- union(rnnew,rnold)

  cindnew <- match(cnold,cnnew)
  rindnew <- match(rnold,rnnew)
  ind <- unname(which(M != 0,arr.ind=T))
  i <- c(i,rindnew[ind[,1]])
  j <- c(j,cindnew[ind[,2]])
  x <- c(x,M@x)
  }

  sparseMatrix(i=i,j=j,x=x,dims=c(length(rnnew),length(cnnew)),dimnames=list(rnnew,cnnew))
}

      

I have verified this with the following details:

df1 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("N","N","M","X","X","Z","Z"))
M1 <- xtabs(~y+x,df1,sparse=T)
df2 <- data.frame(x=c("S","S","T","T","U","V","V","W","W","X"),y=c("N","M","M","K","Z","M","N","N","K","Z"))
M2 <- xtabs(~y+x,df2,sparse=T)
df3 <- data.frame(x=c("A","C","C","B"),y=c("N","M","Z","K"))
M3 <- xtabs(~y+x,df3,sparse=T)
df4 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("F","F","G","G","H","I","L"))
M4 <- xtabs(~y+x,df4,sparse=T)
df5 <- data.frame(x=c("K1","K2","K3","K4"),y=c("J1","J2","J3","J4"))
M5 <- xtabs(~y+x,df5,sparse=T)

      

Which gave:

Ms <- merge.sparse(M1,M2,M3,M4,M5)
as.matrix(Ms)
#   N R S T U V W X A B C K1 K2 K3 K4
#M  0 1 1 1 0 1 0 0 0 0 1  0  0  0  0
#N  1 1 1 0 0 1 1 0 1 0 0  0  0  0  0
#X  0 0 1 1 0 0 0 0 0 0 0  0  0  0  0
#Z  0 0 0 1 2 0 0 1 0 0 1  0  0  0  0
#K  0 0 0 1 0 0 1 0 0 1 0  0  0  0  0
#F  1 1 0 0 0 0 0 0 0 0 0  0  0  0  0
#G  0 1 1 0 0 0 0 0 0 0 0  0  0  0  0
#H  0 0 0 1 0 0 0 0 0 0 0  0  0  0  0
#I  0 0 0 1 0 0 0 0 0 0 0  0  0  0  0
#L  0 0 0 0 1 0 0 0 0 0 0  0  0  0  0
#J1 0 0 0 0 0 0 0 0 0 0 0  1  0  0  0
#J2 0 0 0 0 0 0 0 0 0 0 0  0  1  0  0
#J3 0 0 0 0 0 0 0 0 0 0 0  0  0  1  0
#J4 0 0 0 0 0 0 0 0 0 0 0  0  0  0  1
Ms
#14 x 15 sparse Matrix of class "dgCMatrix"
#   [[ suppressing 15 column names ‘N, ‘R, ‘S ... ]]
#                                
#M  . 1 1 1 . 1 . . . . 1 . . . .
#N  1 1 1 . . 1 1 . 1 . . . . . .
#X  . . 1 1 . . . . . . . . . . .
#Z  . . . 1 2 . . 1 . . 1 . . . .
#K  . . . 1 . . 1 . . 1 . . . . .
#F  1 1 . . . . . . . . . . . . .
#G  . 1 1 . . . . . . . . . . . .
#H  . . . 1 . . . . . . . . . . .
#I  . . . 1 . . . . . . . . . . .
#L  . . . . 1 . . . . . . . . . .
#J1 . . . . . . . . . . . 1 . . .
#J2 . . . . . . . . . . . . 1 . .
#J3 . . . . . . . . . . . . . 1 .
#J4 . . . . . . . . . . . . . . 1

      

I don't know why the column names are "suppressed" when trying to display a merged sparse matrix Ms

; converting to a non-sparse matrix returns them, so ...

Also, I noticed that when the same "coordinates" are included multiple times, the sparse matrix contains the sum of the corresponding values ​​in x

(see row "Z", column "U" which is 1 as in M1

and in M2

). Maybe there is a way to change this, but this is fine for my applications.

I thought I'd share this code in case anyone else needs to concatenate sparse matrices this way, and if someone can test it on large matrices and suggest performance improvements.


EDIT

After checking this post, I found that retrieving information about the (non-zero) elements of a sparse matrix can be done much easier summary

without using which

.

So this part of my code above:

ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)

      

can be replaced with:

ind <- summary(M)
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,ind[,3])

      

Now I don't know which one is computationally more efficient, or is there an even easier way to do this by resizing the matrices and then just summing them, but it seems to me that it works, so ...

0


source







All Articles