Expand the R-footer of the data.frame, keeping the rest of the values ​​in the row

I need to effectively "expand" a column of a list box in an R data.frame. For example, if I have a data.frame defined as:

dbt <- data.frame(values=c(1,1,1,1,2,3,4), 
                  parm1=c("A","B","C","A","B","C","B"),
                  parm2=c("d","d","a","b","c","a","a"))

      

Then, let's take an analysis that generates one column as a list, similar to the following output:

agg <- aggregate(values ~ parm1 + parm2, data=dbt, 
                 FUN=function(x) {return(list(x))})

      

The compiled data.frame looks like this: (where class (agg $ values) == "list"):

  parm1 parm2 values
1     B     a      4
2     C     a   1, 3
3     A     b      1
4     B     c      2
5     A     d      1
6     B     d      1

      

I would like to expand the "values" column by iterating over the values ​​of parm1 and 2 (adding more rows) in an efficient way for each list item across all rows of the data.frame.

At the top level, I wrote a function that performs a reversal in a for loop called in the application. This is really inefficient (cumulative data.frame takes about an hour to create and almost 24 hours to deploy, fully expanded data has ~ 500k records). The top level I'm using is:

unrolled.data <- do.call(rbind, apply(agg, 1, FUN=unroll.data))

      

The function simply calls unlist () on the value column object and then creates the data.frame object in the for loop as the return object.

The environment is somewhat limited and the tidyr, data.table and splitstackshape libraries are not available to me, this requires not only the functions found in the :: database, but only those available in v3.1.1 and earlier. So the answers to this (not exactly duplicated) question don't apply.

Any suggestions for something faster?

Thank!

+3


source to share


1 answer


With an R base, you can try

with(agg, {
    data.frame(
        lapply(agg[,1:2], rep, times=lengths(values)),
        values=unlist(values)
    )
})
#      parm1 parm2 values
# 1.2      B     a      4
# 1.31     C     a      1
# 1.32     C     a      3
# 2.1      A     b      1
# 3.2      B     c      2
# 4.1      A     d      1
# 4.2      B     d      1

      



Timeline for an alternative (thanks @thelatemail)

library(dplyr)
agg %>%
  sample_n(1e7, replace=T) -> bigger

system.time(
    with(bigger, { data.frame(lapply(bigger[,1:2], rep, times=lengths(values)), values=unlist(values)) })
)
# user  system elapsed 
# 3.78    0.14    3.93 

system.time(
    with(bigger, { data.frame(bigger[rep(rownames(bigger), lengths(values)), 1:2], values=unlist(values)) })
)
# user  system elapsed 
# 11.30    0.34   11.64 

      

+3


source







All Articles