Using the amount in a large frame (3.2 GB) is very slow

I have a large dataframe (2628x670316) over 3GB and I want to use the sum function on each row.

The data file looks like this: 0s, 1s, and 2s.

0 1 2 0 0 0 0 0 0 1 1 1 ...
0 1 0 0 0 0 2 2 2 2 2 2 ...
.
.
.

      

When I run sum (data [1,] == 0) it takes a long time. Is there a faster way to do this?

Thanks in advance.

PS. The reason I want to use sum is because I want to get the percentage of 0, 1 and 2 on each row. If there is another way to do it, this answer will be helpful as well.

+3


source to share


3 answers


If df

is your data.frame:

t(apply(df,1,table))*100/ncol(df)

      

will give you percentages 0, 1 and 2 for each row.

(And you avoid mappings that can take a very long time ...)

:

set.seed(13)
df<-data.frame(matrix(sample(c(0,1,2),500,T),nrow=10))

      



t(apply(df,1,table))*100/ncol(df)

gives you:

       0  1  2
 [1,] 34 44 22
 [2,] 38 40 22
 [3,] 28 34 38
 [4,] 26 38 36
 [5,] 36 42 22
 [6,] 30 32 38
 [7,] 42 26 32
 [8,] 30 36 34
 [9,] 36 24 40
[10,] 24 34 42

      

EDIT thanks to @akrun's comment:

In case all possible values ​​(0, 1, 2) are not present on each line, you should do:

t(apply(df, 1, function(x) table(factor(x, levels=0:2))))*100/ncol(df)

      

+3


source


If all data is integers, then it is much faster to represent it as a matrix m

(this is also semantically closer to what the data actually represents) - a rectangular dataset with a homogeneous type, not columns possibly of a different type), perhaps using scan()

... With a matrix, column operations are faster than row operations, so transpose it with t(m)

. The function is tabulate()

much faster than table()

, although slightly more subtle in this case

nonZeroCounts <- apply(t(m), 2, tabulate, max(m))

      

More detailed solutions are offered here

f0 <- function(df)
    t(apply(df, 1, table))

f1 <- function(m) {
    n <- t(apply(t(m), 2, tabulate, max(m)))
    ans <- cbind(ncol(m) - as.integer(rowSums(n)), n)
    colnames(ans) <- 0:max(m)
    ans
}

      

some data



nrow <- 100; ncol <- floor(nrow * 670316 / 2628)
m <- matrix(sample(0:2, nrow * ncol, TRUE), nrow=nrow)
df <- as.data.frame(m)

      

and basic comparison

> system.time(ans0 <- f0(df))
   user  system elapsed 
  1.082   0.000   1.083 
> system.time(ans1 <- f1(m))
   user  system elapsed 
  0.052   0.000   0.052 
> identical(ans0, ans1)
[1] TRUE

      

or with nrow=1000

> system.time(ans1 <- f1(m))
   user  system elapsed 
  6.521   1.461   7.984 
> system.time(ans0 <- f0(df))   ## argh, boring, stop after 1.5 minutes!
  C-c C-c
Timing stopped at: 93.608 2.752 96.325 

      

+2


source


try rowSums

maybe faster

test<-data.frame(V1=c(1,1,1,1), V2=c(2,2,2,0)) 
rowSums(test)

      

I doubt, however, that you can get faster sum functions than vanilla sum.

Another way to get sums is the infamous apply family of functions

apply(test, 1, sum)

      

Were some of the tests and rowSums

pretty fast

set.seed(13)
df<-data.frame(matrix(sample(c(0,1,2),500000000,T),nrow=2000))
system.time(rowSums(df))

system.time(rowSums(df))
   user  system elapsed 
   8.00    0.68    8.69

      

While for apply

system.time(apply(df, 1, sum))


   user  system elapsed 
  81.67    5.99   87.96 

      

0


source







All Articles