R: How to avoid 2 'for' loops in R in this function
I know there are many topics on how to avoid R loops, but I couldn't figure out how I could vectorize my iterations. I have a dataset that I represent to m. I want to create a new matrix with this function, which will consist of p.values correlation coefficients for each data column (m).
m<-matrix(rnorm(100),nrow=10,ncol=10)
sig.p<-function(x){
n= ncol(x)
p.values<-numeric(n)
p.values<-matrix(nrow=n,ncol=n)
for (i in 1:C){
for (t in 1:C){
p.values[t,i]<-cor.test(x[,i],x[,t])$p.value
}
}
p.values
}
sig.p(m)
I couldn't figure out how I can use mapply (if so). Can anyone help with any suggestions on how I could vectorize these iterations (using mapply or other) Thanks in advance!
Cesar
source to share
You can use rcorr
fromlibrary(Hmisc)
library(Hmisc)
rcorr(m)$P
Or use
library(psych)
corr.test(as.data.frame(m))$p
Or using outer
frombase R
outer(1:ncol(m),1:ncol(m), FUN= Vectorize(function(x,y)
cor.test(m[,x], m[,y])$p.value))
Benchmarks
I tried using a smaller dataset ( 100*100
) and a slightly larger dataset ( 1e3*1e3
). Here are the functions:
akrun <- function() {outer(1:ncol(m1),1:ncol(m1),
FUN= Vectorize(function(x,y) cor.test(m1[,x],
m1[,y])$p.value))}
akrun2 <- function(){rcorr(m1)$P}
agstudy <- function() {M <- expand.grid(seq_len(ncol(m1)),
seq_len(ncol(m1)))
mapply(function(x,y)cor.test(m1[,x], m1[,y])$p.value,M$Var1,M$Var2)}
vpipk <-function(){
n <- ncol(m1)
p.values<-matrix(nrow=n,ncol=n)
for (i in 1:(n-1)){
for (t in (i+1):n){
p.values[t,i]<-cor.test(m1[,i],m1[,t])$p.value
}
}
p.values
}
nrussell <- function(){
sapply(1:ncol(m1), function(z){
sapply(1:ncol(m1), function(x,Y=z){
cor.test(m1[,Y],m1[,x])$p.value
})
})
}
In dataset 100*100
library(microbenchmark)
set.seed(25)
m1 <- matrix(rnorm(1e2*1e2),nrow=1e2,ncol=1e2)
microbenchmark(akrun(), akrun2(), agstudy(), vpipk(),
nrussell(), unit='relative', times=10L)
#Unit: relative
# expr min lq mean median uq max neval cld
# akrun() 257.2310 255.9766 252.2163 254.4946 248.9807 246.5429 10 c
# akrun2() 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 10 a
# agstudy() 255.5920 258.0813 253.5411 256.0581 250.4833 249.0503 10 c
# vpipk() 125.8218 126.3337 125.4592 126.8479 124.9835 124.1383 10 b
#nrussell() 257.9283 256.8480 252.5297 256.0160 250.8853 242.0896 10 c
If I changed 1e2
to 1e3
(did not have time to do microbenchmark
, but heresystem.time
system.time(akrun())
# user system elapsed
#403.563 0.751 404.198
system.time(akrun2())
# user system elapsed
# 3.110 0.008 3.117
system.time(agstudy())
# user system elapsed
#445.108 0.877 445.947
system.time(vpipk())
# user system elapsed
#155.597 0.224 155.760
system.time(nrussell())
# user system elapsed
#452.524 1.220 453.713
source to share
Not as succinct as @akrun's answer, but here is a basic R solution:
sig.p <- function(M){
sapply(1:ncol(M), function(z){
sapply(1:ncol(M), function(x,Y=z){
cor.test(M[,Y],M[,x])$p.value
})
})
}
##
R> sig.p(m)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.00000000 0.08034470 0.244411381 0.03293644 0.3234899 0.80352003 0.5326317 0.03896285 0.702987267 0.57721440
[2,] 0.08034470 0.00000000 0.087168145 0.44828479 0.4824117 0.76469973 0.8222813 0.17662866 0.607145382 0.41460977
[3,] 0.24441138 0.08716815 0.000000000 0.20634394 0.9504582 0.11864029 0.2148186 0.28450468 0.009396629 0.51450066
[4,] 0.03293644 0.44828479 0.206343943 0.00000000 0.8378530 0.78122849 0.0544312 0.22943728 0.524608029 0.66329385
[5,] 0.32348990 0.48241166 0.950458153 0.83785303 0.0000000 0.66105999 0.3157296 0.35715193 0.927945195 0.63163949
[6,] 0.80352003 0.76469973 0.118640294 0.78122849 0.6610600 0.00000000 0.7181462 0.67602651 0.749641726 0.03218081
[7,] 0.53263166 0.82228134 0.214818607 0.05443120 0.3157296 0.71814620 0.0000000 0.39393423 0.266039043 0.38619000
[8,] 0.03896285 0.17662866 0.284504679 0.22943728 0.3571519 0.67602651 0.3939342 0.00000000 0.512083873 0.30980598
[9,] 0.70298727 0.60714538 0.009396629 0.52460803 0.9279452 0.74964173 0.2660390 0.51208387 0.000000000 0.92533524
[10,] 0.57721440 0.41460977 0.514500658 0.66329385 0.6316395 0.03218081 0.3861900 0.30980598 0.925335242 0.00000000
source to share
Vectorization isn't always all that it cracks. Not sure how big your actual matrix is, but for that size, or even 100 x 100, that's a fairly small one-time cost.
You can more than double your performance by changing your loop structure as follows:
sig.p<-function(x){
n <- ncol(x)
p.values<-matrix(nrow=n,ncol=n)
for (i in 1:(n-1)){
for (t in (i+1):n){
p.values[t,i]<-cor.test(x[,i],x[,t])$p.value
}
}
p.values
}
Basically only calculate the bottom triangle since you know the diagonal will be zero and the matrix is symmetric. mapply
or sapply
applied across the entire matrix may not work better than this.
source to share