Colmeans in dataframe using factor variable
I am trying to get the average of some variables within a dataframe for various factors. Let's say I have:
time geo var1 var2 var3 var4
1 1990 AT 1 7 13 19
2 1991 AT 2 8 14 20
3 1992 AT 3 9 15 21
4 1990 DE 4 10 16 22
5 1991 DE 5 11 17 23
6 1992 DE 6 12 18 24
I want too:
time geo var1 var2 var3 var4 m_var2 m_var3
1 1990 AT 1 7 13 19 8 14
2 1991 AT 2 8 14 20 8 14
3 1992 AT 3 9 15 21 8 14
4 1990 DE 4 10 16 22 11 17
5 1991 DE 5 11 17 23 11 17
6 1992 DE 6 12 18 24 11 17
I've tried a few things with () and lapply () but I think this goes in the direction of ddply
require(plyr)
Dataset <- data.frame(time=rep(c(1990:1992),2),geo=c(rep("AT",3),rep("DE",3))
,var1=as.numeric(c(1:6)),var2=as.numeric(c(7:12)),var3=as.numeric(c(13:18)),
var4=as.numeric(c(19:24)))
newvars <- c("var2","var3")
newData <- Dataset[,c("geo",newvars)]
Currently I can choose between two errors:
ddply(newData,newData[,"geo"],colMeans)
#where R apparently thinks AT is the variable?
ddply(newData,"geo",colMeans)
#where R worries about the factor variable not being numeric?
My failed attempts got me pretty far, but then left me with a list that I couldn't get back to in the dataframe:
lapply(newvars,function(x){
by(Dataset[x],Dataset[,"geo"],function(x)
rep(colMeans(x,na.rm=T),length(unique(Dataset[,"time"]))))
})
I think it should be possible by merge and filters, like here: Pinning in a dataframe on various variables using filters , but I can't put it together. Any help would be appreciated!
source to share
One option is to use data.table
. We can convert data.frame
to data.table
( setDT(df1)
), get mean
( lapply(.SD, mean)
) for the selected columns ("var2" and "var3") by specifying the column index in .SDcols
, grouped by "geo". Create new columns by assigning output ( :=
) to new column names ( paste('m', names(df1)[4:5])
)
library(data.table)
setDT(df1)[, paste('m', names(df1)[4:5], sep="_") :=lapply(.SD, mean)
,by = geo, .SDcols=4:5]
# time geo var1 var2 var3 var4 m_var2 m_var3
#1: 1990 AT 1 7 13 19 8 14
#2: 1991 AT 2 8 14 20 8 14
#3: 1992 AT 3 9 15 21 8 14
#4: 1990 DE 4 10 16 22 11 17
#5: 1991 DE 5 11 17 23 11 17
#6: 1992 DE 6 12 18 24 11 17
NOTE. This method is more general. We can create columns mean
even for 100s of variables without any major code changes. i.e. if we need to get mean
4: 100 columns, change .SDcols=4:100
also in paste('m', names(df1)[4:100]
.
data
df1 <- structure(list(time = c(1990L, 1991L, 1992L, 1990L, 1991L, 1992L
), geo = c("AT", "AT", "AT", "DE", "DE", "DE"), var1 = 1:6, var2 = 7:12,
var3 = 13:18, var4 = 19:24), .Names = c("time", "geo", "var1",
"var2", "var3", "var4"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
source to share
Another simple basic R solution is just
transform(df, m_var2 = ave(var2, geo), m_var3 = ave(var3, geo))
# time geo var1 var2 var3 var4 m_var2 m_var3
# 1 1990 AT 1 7 13 19 8 14
# 2 1991 AT 2 8 14 20 8 14
# 3 1992 AT 3 9 15 21 8 14
# 4 1990 DE 4 10 16 22 11 17
# 5 1991 DE 5 11 17 23 11 17
# 6 1992 DE 6 12 18 24 11 17
A few years later, I think a more concise approach would be to both update the actual dataset (instead of creating a new one) and work with a vector of columns (instead of manually writing them down)
vars <- paste0("var", 2:3) # Select desired cols
df[paste0("m_", vars)] <- lapply(df[vars], ave, df[["geo"]]) # Loop and update
source to share