R: How to avoid 2 'for' loops in R in this function

I know there are many topics on how to avoid R loops, but I couldn't figure out how I could vectorize my iterations. I have a dataset that I represent to m. I want to create a new matrix with this function, which will consist of p.values ​​correlation coefficients for each data column (m).

m<-matrix(rnorm(100),nrow=10,ncol=10)
sig.p<-function(x){
  n= ncol(x)
  p.values<-numeric(n)
  p.values<-matrix(nrow=n,ncol=n)
  for (i in 1:C){
    for (t in 1:C){
      p.values[t,i]<-cor.test(x[,i],x[,t])$p.value
    }
  }
  p.values
}
sig.p(m)

      

I couldn't figure out how I can use mapply (if so). Can anyone help with any suggestions on how I could vectorize these iterations (using mapply or other) Thanks in advance!

Cesar

+3


source to share


4 answers


You can use rcorr

fromlibrary(Hmisc)

 library(Hmisc)
 rcorr(m)$P

      

Or use

 library(psych)
 corr.test(as.data.frame(m))$p

      

Or using outer

frombase R

  outer(1:ncol(m),1:ncol(m), FUN= Vectorize(function(x,y) 
                              cor.test(m[,x], m[,y])$p.value))

      

Benchmarks



I tried using a smaller dataset ( 100*100

) and a slightly larger dataset ( 1e3*1e3

). Here are the functions:

 akrun <- function() {outer(1:ncol(m1),1:ncol(m1), 
            FUN= Vectorize(function(x,y) cor.test(m1[,x],
                              m1[,y])$p.value))}

 akrun2 <- function(){rcorr(m1)$P}
 agstudy <- function() {M <- expand.grid(seq_len(ncol(m1)),
                          seq_len(ncol(m1)))
      mapply(function(x,y)cor.test(m1[,x], m1[,y])$p.value,M$Var1,M$Var2)}
 vpipk <-function(){
        n <- ncol(m1)
        p.values<-matrix(nrow=n,ncol=n)
   for (i in 1:(n-1)){
      for (t in (i+1):n){
          p.values[t,i]<-cor.test(m1[,i],m1[,t])$p.value
     }
   }
   p.values
  }


 nrussell <- function(){
   sapply(1:ncol(m1), function(z){
   sapply(1:ncol(m1), function(x,Y=z){
     cor.test(m1[,Y],m1[,x])$p.value
     })
   })
}

      

In dataset 100*100

 library(microbenchmark)
 set.seed(25)
 m1 <- matrix(rnorm(1e2*1e2),nrow=1e2,ncol=1e2)
 microbenchmark(akrun(), akrun2(), agstudy(), vpipk(),
                    nrussell(), unit='relative', times=10L)
 #Unit: relative
 #  expr      min       lq     mean   median       uq      max neval cld
 #   akrun() 257.2310 255.9766 252.2163 254.4946 248.9807 246.5429    10   c
 #  akrun2()   1.0000   1.0000   1.0000   1.0000   1.0000   1.0000    10   a  
 # agstudy() 255.5920 258.0813 253.5411 256.0581 250.4833 249.0503    10   c
 #   vpipk() 125.8218 126.3337 125.4592 126.8479 124.9835 124.1383    10   b 
 #nrussell() 257.9283 256.8480 252.5297 256.0160 250.8853 242.0896    10   c

      

If I changed 1e2

to 1e3

(did not have time to do microbenchmark

, but heresystem.time

system.time(akrun())
 # user  system elapsed 
#403.563   0.751 404.198 

system.time(akrun2())
 #  user  system elapsed 
 # 3.110   0.008   3.117 

system.time(agstudy())
 #  user  system elapsed 
 #445.108   0.877 445.947 

system.time(vpipk())
#  user  system elapsed 
#155.597   0.224 155.760 

system.time(nrussell())
#  user  system elapsed 
#452.524   1.220 453.713 

      

+6


source


Not as succinct as @akrun's answer, but here is a basic R solution:



sig.p <- function(M){
  sapply(1:ncol(M), function(z){
    sapply(1:ncol(M), function(x,Y=z){
      cor.test(M[,Y],M[,x])$p.value
    })
  })
}
##
R> sig.p(m)
            [,1]       [,2]        [,3]       [,4]      [,5]       [,6]      [,7]       [,8]        [,9]      [,10]
 [1,] 0.00000000 0.08034470 0.244411381 0.03293644 0.3234899 0.80352003 0.5326317 0.03896285 0.702987267 0.57721440
 [2,] 0.08034470 0.00000000 0.087168145 0.44828479 0.4824117 0.76469973 0.8222813 0.17662866 0.607145382 0.41460977
 [3,] 0.24441138 0.08716815 0.000000000 0.20634394 0.9504582 0.11864029 0.2148186 0.28450468 0.009396629 0.51450066
 [4,] 0.03293644 0.44828479 0.206343943 0.00000000 0.8378530 0.78122849 0.0544312 0.22943728 0.524608029 0.66329385
 [5,] 0.32348990 0.48241166 0.950458153 0.83785303 0.0000000 0.66105999 0.3157296 0.35715193 0.927945195 0.63163949
 [6,] 0.80352003 0.76469973 0.118640294 0.78122849 0.6610600 0.00000000 0.7181462 0.67602651 0.749641726 0.03218081
 [7,] 0.53263166 0.82228134 0.214818607 0.05443120 0.3157296 0.71814620 0.0000000 0.39393423 0.266039043 0.38619000
 [8,] 0.03896285 0.17662866 0.284504679 0.22943728 0.3571519 0.67602651 0.3939342 0.00000000 0.512083873 0.30980598
 [9,] 0.70298727 0.60714538 0.009396629 0.52460803 0.9279452 0.74964173 0.2660390 0.51208387 0.000000000 0.92533524
[10,] 0.57721440 0.41460977 0.514500658 0.66329385 0.6316395 0.03218081 0.3861900 0.30980598 0.925335242 0.00000000

      

+4


source


This is a typical usage mapply

:

M <- expand.grid(seq_len(ncol(m),seq_len(ncol(m)))
mapply(function(x,y)cor.test(m[,x], m[,y])$p.value,M$Var1,M$Var2)

      

+3


source


Vectorization isn't always all that it cracks. Not sure how big your actual matrix is, but for that size, or even 100 x 100, that's a fairly small one-time cost.

You can more than double your performance by changing your loop structure as follows:

sig.p<-function(x){
    n <- ncol(x)
    p.values<-matrix(nrow=n,ncol=n)
    for (i in 1:(n-1)){
        for (t in (i+1):n){
            p.values[t,i]<-cor.test(x[,i],x[,t])$p.value
        }
    }
    p.values
}

      

Basically only calculate the bottom triangle since you know the diagonal will be zero and the matrix is ​​symmetric. mapply

or sapply

applied across the entire matrix may not work better than this.

+1


source







All Articles