Nested performance, how to optimize?
I have a list data.frames
. Inside each data.frame, I want to split by grouping ( z
), run a function, string the results together, then put all the results nested lapply
together in the data.frame, then flatten the list of results data.frame
into one data.frame
.
library(plyr)
df <- data.frame(x = sample(1:200, 30000, replace = TRUE),
y = sample(1:200, 30000, replace = TRUE),
z = sample(LETTERS, 30000, replace = TRUE))
alist <- list(df,df,df) # longer in real life
answer <- lapply(alist, function(q) {
a <- split(q,q$z)
result.1 <- lapply(a, function(w) {
neww <- cbind(w[,1],w[,2])
result.2 <- colSums(neww)
})
ldply(result.1)
})
# cor(neww) can actually be a variey of foos I just use cor() for easy reproducibility
ldply(answer)
It has very heavy memory usage as well as slow. Thanks to @Andrie, I know how to clear the workspace before starting:
rm(list=setdiff(ls(), "alist"))
But is there a way to change my approach, like junking w
in the second lapply
, etc., to try and reduce memory usage and make things faster? foo
Like the matrix in this case , so data.table
won't be my answer. In another, foo
I will need everything w
and the class should bedata.frame
source to share
Try something like this:
ldply(alist, ddply, "z", summarize, xy.foo = foo(x, y))
If you want x
u to y
appear in your latest data.frame, replace summarize
with transform
. Also, looking at your usage foo
, you might need to replace (x, y)
with cbind(x, y)
.
Also, I would recommend the profile of your code to you. After all, it foo
might be what is slowing you down, not the split / comb part.
source to share
Why don't you use ddply
and llply
from plyr
, but only ldply
??
# Note: @Flodel has a very nice, simple one-line plyr solution
# Please use that.
out <- ldply(alist, function(q) {
ddply(q, .(z), function(w) {
neww <- w[, -3]
result.2 <- colSums(neww) # dummy function
})
})
The first ldply
passes the list items alist
one by one. Every time, q
therefore, is data.frame
contained in every element list
. Then, in this case, we would like to divide by z
. Since the input q
is equal data.frame
and the output should also be data.frame
, we use ddply
with the second argument .(z)
to split by z
. This is where you do your calculations, return whatever you want ( colSums
in this case). ldply
returns as data.frame
.
Data.table
Solution: An alternative swift solution would be to use Data.table
in merged data.frame
, which can be achieved like this (which @Roland mentioned also in his comments):
require(data.table)
# for creating a group
group <- vapply(alist, nrow, integer(1))
dt <- data.table(do.call(rbind, alist))
# create group
dt[ , grp := rep(1:3, group)]
setkey(dt, "grp", "z")
# call your function (here column means)
dt[, lapply(.SD, mean), by="grp,z"]
# or if its correlation
dt[, list(cor_x_y = cor(x,y)), by="grp,z"]
source to share