Summing data frame sections in R
For sample data:
structure(list(id = 1:10, group.id = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 3L, 3L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor"),
x = c(2.12, 1.23, 2.36, 4.21, 2.36, NA, 2.36, 4.36, 1.23,
2.23), y = c(6.56, 2.36, NA, 4.36, 1.23, 8.56, 4.23, 5.36,
2.36, 1.23), z = c(4.36, NA, 5.23, 5.36, 1.23, 4.23, 1.23,
NA, 3.26, 2.23), group.x = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), group.y = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), group.z = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA)), .Names = c("id", "group.id", "x", "y", "z", "group.x",
"group.y", "group.z"), class = "data.frame", row.names = c(NA,
-10L))
I want to fill group.x / y / z with the average in the x, y and z columns BY group id.
So, the average of the values in IDs 1,2,3 and 10 is averaged and filled in the corresponding columns "group.x", "group.y" and group.z "This is subsequently done for groups b and c by filling in the rows.
Ideally, I would like an additional table with detailed descriptions of the groups and the number of values and means, so I could estimate how representative the values are. With my basic knowledge of R, I would just subtract the dataframe and average and count for each section, however there must be a better way ... Any ideas?
source to share
We could use data.table
to create new columns with mean
'x', 'y', 'z' value grouped by 'group.id' column. We will convert "data.frame" to "data.table" with setDT(df1)
(or alternatively we can use as.data.table
as suggested by @Ricardo Saporta. One of the advantages is that the original dataset remains the same. I prefer to use setDT
(only subjective)). We don't need to create NA columns in the original dataset.
library(data.table)
setDT(df1)[, paste('group', c('x', 'y', 'z'), sep=".") :=
lapply(.SD, mean, na.rm=TRUE), group.id, .SDcols=c('x','y','z')]
Assuming we already have NA columns, make sure the class is the same, like "numeric"
setDT(df1)[, 6:8 := lapply(.SD, as.numeric), .SDcols=6:8][,
paste('group', c('x', 'y', 'z'), sep=".") :=
lapply(.SD, mean, na.rm=TRUE), group.id, .SDcols=c('x','y','z')]
source to share