Loop to remove duplicates in many R trials

I have a dataset (called eyeData strong>) that, in a very short version, looks like this:

sNumber runningTrialNo  wordTar                             
1       1               vital       
1       1               raccoon                             
1       1               vital                               
1       1               accumulates                             
1       2               tornado                             
1       2               destroys                                
1       2               tornado                             
1       2               destroys                                
1       2               property                                
4       51              denounces                               
4       51              brings                              
4       51              illegible                               
4       51              frequently                              
4       51              brings                          
4       61              cerebrum
4       61              vital
4       61              knowledge
4       61              vital
4       61              cerebrum

      

I wrote a loop to remove all duplicates (same words) of the wordTar column for each test separately, so the data would look like this:

   sNumber  runningTrialNo  wordTar                             
1           1               vital       
1           1               raccoon                         
1           1               accumulates                             
1           2               tornado                             
1           2               destroys                                
1           2               property                                
4           51              denounces                               
4           51              brings                              
4           51              illegible                               
4           51              frequently                  
4           61              cerebrum
4           61              vital
4           61              knowledge
4           61              cerebrum                        

      

Here's the code:

for (sno in eyeData$sNumber) {
for(trial in eyeData$runningTrialNo) {
ss <- subset(eyeData, sNumber == sno & runningTrialNo == trial)
ss.s <- ss[!duplicated(ss$wordTar), ]
 }
}

      

However, it works for a very long time, so I close it ... since I am new to the R environment, I am assuming that I am doing something wrong with the loop. Is there a way to improve my loop so that it gives me the desired result?

+3


source to share


1 answer


For loops in general it is slower in R. Usually you want to vectorize your code . There are many ways to do this, here is an example of using the library dplyr

:

library(dplyr)
eyeData %>% group_by(runningTrialNo) %>%
            distinct(wordTar)

      



This is much faster, we can see using microbenchmark

where we run the code 100 times and see how long it takes:

library(microbenchmark)

microbenchmark(dplyr = eyeData %>% group_by(runningTrialNo) %>%
                   distinct(wordTar), 
               old = for (sno in eyeData$sNumber) {
                       for(trial in eyeData$runningTrialNo) {
                           ss <- subset(eyeData, sNumber == sno & runningTrialNo == trial)
                           ss.s <- ss[!duplicated(ss$wordTar), ]
                       }
                   })

Unit: milliseconds
  expr        min         lq       mean     median         uq       max neval
 dplyr   1.256438   1.287158   1.567518   1.495092   1.550579  12.29212   100
   old 102.203029 110.265423 112.664063 111.789698 113.166710 304.58312   100

      

+1


source







All Articles