Using the amount in a large frame (3.2 GB) is very slow
I have a large dataframe (2628x670316) over 3GB and I want to use the sum function on each row.
The data file looks like this: 0s, 1s, and 2s.
0 1 2 0 0 0 0 0 0 1 1 1 ...
0 1 0 0 0 0 2 2 2 2 2 2 ...
.
.
.
When I run sum (data [1,] == 0) it takes a long time. Is there a faster way to do this?
Thanks in advance.
PS. The reason I want to use sum is because I want to get the percentage of 0, 1 and 2 on each row. If there is another way to do it, this answer will be helpful as well.
source to share
If df
is your data.frame:
t(apply(df,1,table))*100/ncol(df)
will give you percentages 0, 1 and 2 for each row.
(And you avoid mappings that can take a very long time ...)
:
set.seed(13)
df<-data.frame(matrix(sample(c(0,1,2),500,T),nrow=10))
t(apply(df,1,table))*100/ncol(df)
gives you:
0 1 2
[1,] 34 44 22
[2,] 38 40 22
[3,] 28 34 38
[4,] 26 38 36
[5,] 36 42 22
[6,] 30 32 38
[7,] 42 26 32
[8,] 30 36 34
[9,] 36 24 40
[10,] 24 34 42
EDIT thanks to @akrun's comment:
In case all possible values (0, 1, 2) are not present on each line, you should do:
t(apply(df, 1, function(x) table(factor(x, levels=0:2))))*100/ncol(df)
source to share
If all data is integers, then it is much faster to represent it as a matrix m
(this is also semantically closer to what the data actually represents) - a rectangular dataset with a homogeneous type, not columns possibly of a different type), perhaps using scan()
... With a matrix, column operations are faster than row operations, so transpose it with t(m)
. The function is tabulate()
much faster than table()
, although slightly more subtle in this case
nonZeroCounts <- apply(t(m), 2, tabulate, max(m))
More detailed solutions are offered here
f0 <- function(df)
t(apply(df, 1, table))
f1 <- function(m) {
n <- t(apply(t(m), 2, tabulate, max(m)))
ans <- cbind(ncol(m) - as.integer(rowSums(n)), n)
colnames(ans) <- 0:max(m)
ans
}
some data
nrow <- 100; ncol <- floor(nrow * 670316 / 2628)
m <- matrix(sample(0:2, nrow * ncol, TRUE), nrow=nrow)
df <- as.data.frame(m)
and basic comparison
> system.time(ans0 <- f0(df))
user system elapsed
1.082 0.000 1.083
> system.time(ans1 <- f1(m))
user system elapsed
0.052 0.000 0.052
> identical(ans0, ans1)
[1] TRUE
or with nrow=1000
> system.time(ans1 <- f1(m))
user system elapsed
6.521 1.461 7.984
> system.time(ans0 <- f0(df)) ## argh, boring, stop after 1.5 minutes!
C-c C-c
Timing stopped at: 93.608 2.752 96.325
source to share
try rowSums
maybe faster
test<-data.frame(V1=c(1,1,1,1), V2=c(2,2,2,0))
rowSums(test)
I doubt, however, that you can get faster sum functions than vanilla sum.
Another way to get sums is the infamous apply family of functions
apply(test, 1, sum)
Were some of the tests and rowSums
pretty fast
set.seed(13)
df<-data.frame(matrix(sample(c(0,1,2),500000000,T),nrow=2000))
system.time(rowSums(df))
system.time(rowSums(df))
user system elapsed
8.00 0.68 8.69
While for apply
system.time(apply(df, 1, sum))
user system elapsed
81.67 5.99 87.96
source to share