Smooth out each group with `do`
I have some data, a sample of which is below. My goal is to apply gam
to each Year and get a different value, which is the predicted value from the gamma model.
fertility <- structure(list(AGE = c(15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L,
36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L,
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L
), Year = c(1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930,
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930,
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1931,
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931,
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931,
1931, 1931, 1931, 1931, 1931, 1931, 1931), fertility = c(5.170284269,
14.18135114, 27.69795144, 44.61216712, 59.08896308, 89.66036496,
105.4563852, 120.1754041, 137.4074262, 148.7159407, 161.5645606,
157.200515, 143.6340251, 127.8855125, 117.7343628, 159.2909484,
126.6158821, 109.0681613, 86.98223678, 70.64470361, 111.0070633,
86.15051988, 68.9204159, 55.92722274, 42.93402958, 56.84376018,
39.35337243, 26.72142573, 18.46207596, 9.231037978, 4.769704534,
13.08261815, 25.55198857, 41.15573626, 54.51090896, 81.99522459,
96.44082973, 109.9015072, 125.6603492, 136.0020892, 148.679958,
144.6639404, 132.1793638, 117.6867783, 108.345172, 144.2820726,
114.68575, 98.79142865, 78.7865069, 63.9883456, 100.217918, 77.77726461,
62.22181169, 50.49147014, 38.76112859, 52.48807067, 36.33789508,
24.67387938, 17.04740757, 8.523703784)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -60L), .Names = c("AGE",
"Year", "fertility"))
So a non-dplyr, "dumb" way of doing this would be
count <- 0
for (i in 1930:1931){
count <- count + 1
temp <- filter(fertility, Year == i)
mod <- mgcv::gam(fertility ~ s(AGE), data=temp)
pred[length(15:44) * (count - 1) + 1:30] <- predict(mod, newdata = data.frame(AGE = 15:44))
}
fertility1 <- mutate(fertility, pred = pred)
But I need a method in dplyr
. I thought to use do
to create a model for each column and then use predict
to get the values. The first step I can take, but I am struggling to implement the second part in dplyr
:
library(mgcv)
library(dplyr)
fertility %>%
#filter(!is.na(fertility)) %>% # not sure if this is necessary
group_by(Year) %>%
dplyr::do(model = mgcv::gam(fertility ~ s(AGE), data = .)) %>%
left_join(fertility, .) %>%
mutate(smoothed = predict(model, newdata = AGE))
I am getting the error
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "list"
which supposedly means dplyr
doesn't remember which model
is the model and not just a list item.
source to share
by
terms in
s()
or through a new base type
bs = "fs"
. Here's an example with your data:
library("mgcv")
## Make Year a factor
fertility <- transform(fertility, Year = factor(Year))
## Fit model using by terms - include factor as fixed effect too!
mod <- gam(fertility ~ Year + s(AGE, by = Year), data = fertility)
## Plot to see what form this model takes
plot(mod, pages = 1)
## Some prediction data
ages <- with(fertility, seq(min(AGE), max(AGE)))
## Need to replicate this once per Year
pdat <- with(fertility,
data.frame(AGE = rep(ages, nlevels(Year)),
Year = rep(levels(Year), each = length(ages))))
## Add the fitted values to the prediction data
pdat <- transform(pdat, fitted = predict(mod, newdata = pdat))
head(pdat)
> head(pdat)
AGE Year fitted
1 15 1930 -0.8496705
2 16 1930 15.9568574
3 17 1930 33.0754019
4 18 1930 50.7419122
5 19 1930 68.9116594
6 20 1930 87.1306489
However, you can simply query the set values if all you want to do is predict for the observed values AGES
:
fertility <- transform(fertility, fitted = predict(mod))
head(fertility)
> head(fertility)
AGE Year fertility fitted
1 15 1930 5.170284 -0.8496705
2 16 1930 14.181351 15.9568574
3 17 1930 27.697951 33.0754019
4 18 1930 44.612167 50.7419122
5 19 1930 59.088963 68.9116594
6 20 1930 89.660365 87.1306489
You can also look at the specific factor-smooth type bs = "fs"
and ?smooth.terms
and ?factor.smooth.interaction
for details; they are mostly effective if you have many levels, but want each level to smooth the same value for the anti-aliasing parameter.
The main advantage here is that you use all your data and approach one model, which you can then interrogate in a number of ways that you cannot easily discover if you put several separate models, for example, you can investigate the differences in smooths in a year.
source to share
A non-dplyr, the smart way to do this would be
do.call(rbind,
lapply(split(fertility, fertility$Year), function(df) {
df$pred <- predict(gam(fertility ~ s(AGE), data=df))
df
}))
See ?do.call
, ?lapply
and ?split
.
Or, if you don't like nested function calls:
fertility %>%
split(fertility$Year) %>%
lapply(function(df) {
df$pred <- predict(gam(fertility ~ s(AGE), data=df))
df
}) %>%
do.call(rbind, .)
source to share
data.frame save source in the results do
, as suggested @Henrik:
df %>%
group_by(Year) %>%
do(data.frame(.,pred = predict(gam(fertility ~ s(AGE), data=.))))
Add data.table
to chain.
require(data.table)
df %>%
data.table %>%
group_by(Year) %>%
mutate(pred = predict(gam(fertility ~ s(AGE))))
Failure mutate
without line data.table
could have something to do with recent gam
scope changes that @GavinSimpson summarized in the chat .
source to share