Nested performance, how to optimize?

Question

Nested performance, how to optimize?

I have a list data.frames

. Inside each data.frame, I want to split by grouping ( z

), run a function, string the results together, then put all the results nested lapply

together in the data.frame, then flatten the list of results data.frame

into one data.frame

.

library(plyr)
df <- data.frame(x = sample(1:200, 30000, replace = TRUE), 
                y = sample(1:200, 30000, replace = TRUE), 
                z = sample(LETTERS, 30000, replace = TRUE))

alist <- list(df,df,df) # longer in real life
answer <- lapply(alist, function(q) {
    a <- split(q,q$z)
    result.1 <- lapply(a, function(w) {
        neww <- cbind(w[,1],w[,2])
        result.2 <- colSums(neww)
    })
    ldply(result.1)
})
# cor(neww) can actually be a variey of foos I just use cor() for easy reproducibility
ldply(answer)

It has very heavy memory usage as well as slow. Thanks to @Andrie, I know how to clear the workspace before starting:

 rm(list=setdiff(ls(), "alist"))

But is there a way to change my approach, like junking w

in the second lapply

, etc., to try and reduce memory usage and make things faster? foo

Like the matrix in this case , so data.table

won't be my answer. In another, foo

I will need everything w

and the class should bedata.frame

+3

list r lapply

user1320502 Jan 31. 13 at 11:56

source to share

2 answers

Why don't you use ddply

and llply

from plyr

, but only ldply

??

# Note: @Flodel has a very nice, simple one-line plyr solution
# Please use that.
out <- ldply(alist, function(q) {
    ddply(q, .(z), function(w) {
        neww <- w[, -3]
        result.2 <- colSums(neww) # dummy function
    })
})

The first ldply

passes the list items alist

one by one. Every time, q

therefore, is data.frame

contained in every element list

. Then, in this case, we would like to divide by z

. Since the input q

is equal data.frame

and the output should also be data.frame

, we use ddply

with the second argument .(z)

to split by z

. This is where you do your calculations, return whatever you want ( colSums

in this case). ldply

returns as data.frame

.

Data.table

Solution: An alternative swift solution would be to use Data.table

in merged data.frame

, which can be achieved like this (which @Roland mentioned also in his comments):

require(data.table)
# for creating a group 
group <- vapply(alist, nrow, integer(1))
dt <- data.table(do.call(rbind, alist))
# create group
dt[ , grp := rep(1:3, group)]
setkey(dt, "grp", "z")
# call your function (here column means)
dt[, lapply(.SD, mean), by="grp,z"]
# or if its correlation
dt[, list(cor_x_y = cor(x,y)), by="grp,z"]

+6

Arun Jan 31. 13 at 12:12

source to share

flodel · Accepted Answer · 2013-01-31T12:22:47+0000

Try something like this:

ldply(alist, ddply, "z", summarize, xy.foo = foo(x, y))

If you want x

u to y

appear in your latest data.frame, replace summarize

with transform

. Also, looking at your usage foo

, you might need to replace (x, y)

with cbind(x, y)

.

Also, I would recommend the profile of your code to you. After all, it foo

might be what is slowing you down, not the split / comb part.

Nested performance, how to optimize?

More articles: