Creating a new framework using weighted averages from data in a list

Question

Creating a new framework using weighted averages from data in a list

I have a lot of data stored in a list and I want to create weighted averages of them and store the results in a new dataframe. For example, with a list:

dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")), 
                                      .Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"), 
                      df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")), 
                                      .Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")), 
                 .Names = c("df1", "df2"))

In this example, I want to use columns A

, B

and Weight

for weighted averages. I also want to move related data like Site

, and want to sum the quantity TRUE

and FALSE

. My desired output would look something like this:

result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"), 
    A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L, 
    1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))


    Site    A.Weight    B.Weight    Sum.Weight
1   X       4.5         6           2
2   Y       8.0         4           1

The above example is a very simple example, but my real data contains a lot of data in the list and many more columns than simple A

and B

for which I want to calculate weighted averages. I also have a few columns similar to Site

that are constant in each dataframe and that I want to go to the result.

I can manually calculate the weighted averages using something like

weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)

but I'm not sure how I can make it shorter, less "manual". Does anyone have any recommendations? I recently found out how lapply

through dataframes in the list, but my attempts have not been that great so far.

+3

r

Thomas Sep 30 14 at 2:05

source to share

2 answers

through dplyr

 library(dplyr)
 library('devtools')
 install_github('hadley/tidyr')
 library(tidyr)

 unnest(dfs) %>%
           group_by(Site) %>% 
           filter(Weight) %>% 
           mutate(Sum=n()) %>%
           select(-Weight) %>% 
           summarise_each(funs(mean=mean(., na.rm=TRUE)))

gives the result

 #  Site   A B Sum
 #1    X 4.5 6   2
 #2    Y 8.0 4   1

Or using data.table

 library(data.table)
 DT <- rbindlist(dfs)
 DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE), 
                Sum=.N), by = Site, .SDcols = c("A", "B")]
 #   Site   A B Sum
 #1:    X 4.5 6   2
 #2:    Y 8.0 4   1

Update

In response to @jazzuro's comment using dplyr 0.3

, I get

   unnest(dfs) %>% 
             group_by(Site) %>% 
             summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
                    Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
             select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight) 

  #    Site A_weighted.mean B_weighted.mean Sum.Weight
  #1    X             4.5               6          2
  #2    Y             8.0               4          1

+2

akrun Sep 30 14 at 6:02

source to share

Chase · Accepted Answer · 2014-09-30T02:29:59+0000

The trick is to create a function that works for one data.frame and then use it lapply

to iterate over your list. Since it lapply

returns a list, we'll use the do.call

to rbind

result objects together:

foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
  means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
  sumWeight <- sum(data[, weightCol])
  others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
  out <- data.frame(others, means, sumWeight)
  return(out)
}

In action:

do.call(rbind, lapply(dfs, foo))
---
    Site   A B sumWeight
df1    X 4.5 6         2
df2    Y 8.0 4         1

Since you said it was a minimal example, here's one approach to expanding this for other columns. We will grepl()

and will use regular expressions to determine the correct columns. Alternatively, you can write them all into a vector. Something like that:

do.call(rbind, lapply(dfs, foo, 
                      meanCols = grepl("A|B", names(dfs[[1]])),
                      otherCols = grepl("Site", names(dfs[[1]]))
                      ))

Creating a new framework using weighted averages from data in a list

Update

More articles: