Compare information between two matrices R
I have two matrices, one of which is created from the other by deleting some rows. For example:
m = matrix(1:18, 6, 3)
m1 = m[c(-1, -3, -6),]
Suppose I don't know which rows in m were removed to create m1, how am I supposed to find this by comparing two matrices? The result I want is as follows:
1, 3, 6
The actual matrix I am dealing with is very large. I was wondering if there is an efficient way of doing it.
source to share
Here are some approaches:
1) If we can assume that m
there are no duplicate rows in the example in the question, then:
which(tail(!duplicated(rbind(m1, m)), nrow(m)))
## [1] 1 3 6
2) Pass m
and m1
, giving tm
and tm1
, since it is more efficient to work with columns than rows.
Determine match_indexes(i)
which returns a vector r such that each row m[r, ]
matches m1[i, ]
.
Apply this to each i in 1: n1 and remove the result from 1: n.
n <- nrow(m); n1 <- nrow(m1)
tm <- t(m); tm1 <- t(m1)
match_indexes <- function(i) which(colSums(tm1[, i] == tm) == n1)
setdiff(1:n, unlist(lapply(1:n1, match_indexes)))
## [1] 1 3 6
3) Calculate the interaction vector for each matrix, then use setdiff
and finally match
to get the indices:
i <- interaction(as.data.frame(m))
i1 <- interaction(as.data.frame(m1))
match(setdiff(i, i1), i)
## [1] 1 3 6
Added . If there m
can be duplicates in, then (1) and (3) will only return the first of any multiplying string in m
not in m1
.
m <- matrix(1:18, 6, 3)
m1 <- m[c(2, 4, 5),]
m <- rbind(m, m[1:2, ])
# 1
which(tail(!duplicated(rbind(m1, m)), nrow(m)))
## 1 3 6
# 2
n <- nrow(m); n1 <- nrow(m1)
tm <- t(m); tm1 <- t(m1)
match_indexes <- function(i) which(colSums(tm1[, i] == tm) == n1)
setdiff(1:n, unlist(lapply(1:n1, match_indexes)))
## 1 3 6 7
# 3
i <- interaction(as.data.frame(m))
i1 <- interaction(as.data.frame(m1))
match(setdiff(i, i1), i)
## 1 3 6
source to share
A possible way is to represent each line as a string:
x1 <- apply(m, 1, paste0, collapse = ';')
x2 <- apply(m1, 1, paste0, collapse = ';')
which(!x1 %in% x2)
# [1] 1 3 6
Some large matrix tests using my solution and G. Grothendieck's solutions :
set.seed(123)
m <- matrix(rnorm(20000 * 5000), nrow = 20000)
m1 <- m[-sample.int(20000, 1000), ]
system.time({
which(tail(!duplicated(rbind(m1, m)), nrow(m)))
})
# user system elapsed
# 339.888 2.368 342.204
system.time({
x1 <- apply(m, 1, paste0, collapse = ';')
x2 <- apply(m1, 1, paste0, collapse = ';')
which(!x1 %in% x2)
})
# user system elapsed
# 395.428 0.568 395.955
system({
n <- nrow(m); n1 <- nrow(m1)
tm <- t(m); tm1 <- t(m1)
match_indexes <- function(i) which(colSums(tm1[, i] == tm) == n1)
setdiff(1:n, unlist(lapply(1:n1, match_indexes)))
})
# > 15 min, not finish
system({
i <- interaction(as.data.frame(m))
i1 <- interaction(as.data.frame(m1))
match(setdiff(i, i1), i)
})
# run out of memory. My 32G RAM machine crashed.
source to share