Dplyr to group the dataframe and then create a regression model for each group
Can anyone suggest an answer dplyr
to the following question?
Divide the data.frame by country and create a linear regression model for each subset
For completeness, the question and answer from the link are below.
Question
For reference, here's Josh's question:
I have a data.frame of data from the World Bank that looks something like this:
country date BirthRate US.
4 Aruba 2011 10.584 25354.8
5 Aruba 2010 10.804 24289.1
6 Aruba 2009 11.060 24639.9
7 Aruba 2008 11.346 27549.3
8 Aruba 2007 11.653 25921.3
9 Aruba 2006 11.977 24015.4
In total, there are 70 country lookups in this dataframe that I would like to run linear regression into. If I use the following, I get a nice lm for a specific country;
andora = subset(high.sub, country == "Andorra")
andora.lm = lm(BirthRate~US., data = andora)
anova(andora.lm)
summary(andora.lm)
But when I try to use the same type of code in a for loop, I get an error, which I will print below the code;
high.sub = subset(highInc, date > 1999 & date < 2012)
high.sub <- na.omit(high.sub)
highnames <- unique(high.sub$country)
for (i in highnames) {
linmod <- lm(BirthRate~US., data = high.sub, subset = (country == "[i]"))
}
#Error message:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
If I can get this loop running, I would ideally like to add the coefficients and even improve the r-squared values for each model in an empty data.frame file. Any help would be greatly appreciated.
Answer
For reference, here's jlhoward's answer (including BondedDust's comment) using the * apply functions found in this great question: R Group functions: sapply vs. lapply vs. apply. vs. tapply vs. vs. aggregate
models <- sapply(unique(as.character(df$country)),
function(cntry)lm(BirthRate~US.,df,subset=(country==cntry)),
simplify=FALSE,USE.NAMES=TRUE)
# to summarize all the models
lapply(models,summary)
# to run anova on all the models
lapply(models,anova)
#This produces a named list of models, so you could extract the model for Aruba as:
models[["Aruba"]]
source to share
It dplyr
is not yet possible to return the list from the list . If you just need @jazzurro's interception and tilt this is the answer, but if you want the whole model you need to do something like
library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))
Then, if you want to do ANOVA for each installed model, you can do it using rowwise
models %>% rowwise %>% do(anova(.$mod))
but again the result is coerced into the dataframe and is not exactly the same as when executed lapply(models$mod, anova)
.
For now (i.e. until the next version dplyr
), if you need to keep the entire result in a list, you can just use dlply
from plyr
for example plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d)))
, or of course if you don't have to use dplyr
, you can go to @ SvenHohenstein's answer which looks like the best the way to do it anyway.
source to share
Here's one way to use it dplyr
. I created Brazil
using your data. So you have the same results for the two countries. You can see interception and tilt in the final data frame.
DATA AND CODE
structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Aruba", "Brazil"), class = "factor"),
date = c(2011L, 2010L, 2009L, 2008L, 2007L, 2006L, 2011L,
2010L, 2009L, 2008L, 2007L, 2006L), BirthRate = c(10.584,
10.804, 11.06, 11.346, 11.653, 11.977, 10.584, 10.804, 11.06,
11.346, 11.653, 11.977), US. = c(25354.8, 24289.1, 24639.9,
27549.3, 25921.3, 24015.4, 25354.8, 24289.1, 24639.9, 27549.3,
25921.3, 24015.4)), .Names = c("country", "date", "BirthRate",
"US."), class = "data.frame", row.names = c("4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15"))
# country date BirthRate US.
#4 Aruba 2011 10.584 25354.8
#5 Aruba 2010 10.804 24289.1
#6 Aruba 2009 11.060 24639.9
#7 Aruba 2008 11.346 27549.3
#8 Aruba 2007 11.653 25921.3
#9 Aruba 2006 11.977 24015.4
#10 Brazil 2011 10.584 25354.8
#11 Brazil 2010 10.804 24289.1
#12 Brazil 2009 11.060 24639.9
#13 Brazil 2008 11.346 27549.3
#14 Brazil 2007 11.653 25921.3
#15 Brazil 2006 11.977 24015.4
group_by(mydf, country) %>%
do({model = lm(BirthRate ~ US., data = .);
data.frame(int = coef(model)[1], slope = coef(model)[2])})
# country int slope
#1 Aruba 11.02503 8.393295e-06
#2 Brazil 11.02503 8.393295e-06
source to share