Smooth out each group with `do`

I have some data, a sample of which is below. My goal is to apply gam

to each Year and get a different value, which is the predicted value from the gamma model.

fertility <- structure(list(AGE = c(15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 
23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 
36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 15L, 16L, 17L, 18L, 
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L
), Year = c(1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1931, 
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 
1931, 1931, 1931, 1931, 1931, 1931, 1931), fertility = c(5.170284269, 
14.18135114, 27.69795144, 44.61216712, 59.08896308, 89.66036496, 
105.4563852, 120.1754041, 137.4074262, 148.7159407, 161.5645606, 
157.200515, 143.6340251, 127.8855125, 117.7343628, 159.2909484, 
126.6158821, 109.0681613, 86.98223678, 70.64470361, 111.0070633, 
86.15051988, 68.9204159, 55.92722274, 42.93402958, 56.84376018, 
39.35337243, 26.72142573, 18.46207596, 9.231037978, 4.769704534, 
13.08261815, 25.55198857, 41.15573626, 54.51090896, 81.99522459, 
96.44082973, 109.9015072, 125.6603492, 136.0020892, 148.679958, 
144.6639404, 132.1793638, 117.6867783, 108.345172, 144.2820726, 
114.68575, 98.79142865, 78.7865069, 63.9883456, 100.217918, 77.77726461, 
62.22181169, 50.49147014, 38.76112859, 52.48807067, 36.33789508, 
24.67387938, 17.04740757, 8.523703784)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -60L), .Names = c("AGE", 
"Year", "fertility"))

      

So a non-dplyr, "dumb" way of doing this would be

count <- 0
for (i in 1930:1931){
  count <- count + 1
  temp <- filter(fertility, Year == i)
  mod <- mgcv::gam(fertility ~ s(AGE), data=temp)
  pred[length(15:44) * (count - 1) + 1:30] <- predict(mod, newdata = data.frame(AGE = 15:44))
}

fertility1 <- mutate(fertility, pred = pred)

      

But I need a method in dplyr

. I thought to use do

to create a model for each column and then use predict

to get the values. The first step I can take, but I am struggling to implement the second part in dplyr

:

library(mgcv)
library(dplyr)

  fertility %>%
    #filter(!is.na(fertility)) %>%  # not sure if this is necessary
    group_by(Year) %>%
    dplyr::do(model = mgcv::gam(fertility ~ s(AGE), data = .)) %>%
    left_join(fertility, .) %>%
    mutate(smoothed = predict(model, newdata = AGE))

      

I am getting the error

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "list"

      

which supposedly means dplyr

doesn't remember which model

is the model and not just a list item.

+3


source to share


4 answers


A smart way to do this would be to use the factor-smooth interactions available in mgcv for ages, either through by

terms in s()

or through a new base type bs = "fs"

. Here's an example with your data:
library("mgcv")
## Make Year a factor
fertility <- transform(fertility, Year = factor(Year))
## Fit model using by terms - include factor as fixed effect too!
mod <- gam(fertility ~ Year + s(AGE, by = Year), data = fertility)
## Plot to see what form this model takes
plot(mod, pages = 1)

      

enter image description here

## Some prediction data
ages <- with(fertility, seq(min(AGE), max(AGE)))
## Need to replicate this once per Year
pdat <- with(fertility,
             data.frame(AGE = rep(ages, nlevels(Year)),
                        Year = rep(levels(Year), each = length(ages))))
## Add the fitted values to the prediction data
pdat <- transform(pdat, fitted = predict(mod, newdata = pdat))
head(pdat)

> head(pdat)
  AGE Year     fitted
1  15 1930 -0.8496705
2  16 1930 15.9568574
3  17 1930 33.0754019
4  18 1930 50.7419122
5  19 1930 68.9116594
6  20 1930 87.1306489

      

However, you can simply query the set values ​​if all you want to do is predict for the observed values AGES

:



fertility <- transform(fertility, fitted = predict(mod))
head(fertility)

> head(fertility)
  AGE Year fertility     fitted
1  15 1930  5.170284 -0.8496705
2  16 1930 14.181351 15.9568574
3  17 1930 27.697951 33.0754019
4  18 1930 44.612167 50.7419122
5  19 1930 59.088963 68.9116594
6  20 1930 89.660365 87.1306489

      

You can also look at the specific factor-smooth type bs = "fs"

and ?smooth.terms

and ?factor.smooth.interaction

for details; they are mostly effective if you have many levels, but want each level to smooth the same value for the anti-aliasing parameter.

The main advantage here is that you use all your data and approach one model, which you can then interrogate in a number of ways that you cannot easily discover if you put several separate models, for example, you can investigate the differences in smooths in a year.

+10


source


A non-dplyr, the smart way to do this would be

do.call(rbind,
        lapply(split(fertility, fertility$Year), function(df) {
            df$pred <- predict(gam(fertility ~ s(AGE), data=df))
            df
        }))

      

See ?do.call

, ?lapply

and ?split

.



Or, if you don't like nested function calls:

fertility %>%
   split(fertility$Year) %>%
   lapply(function(df) {
       df$pred <- predict(gam(fertility ~ s(AGE), data=df))
       df
   }) %>%
   do.call(rbind, .)

      

+3


source


data.frame save source in the results do

,
as suggested @Henrik:

df %>%
   group_by(Year) %>%
   do(data.frame(.,pred = predict(gam(fertility ~ s(AGE), data=.))))

      


Add data.table

to chain.

require(data.table)
df %>%
   data.table     %>%
   group_by(Year) %>%
   mutate(pred = predict(gam(fertility ~ s(AGE))))

      

Failure mutate

without line data.table

could have something to do with recent gam

scope changes that @GavinSimpson summarized in the chat .

+3


source


Same result:

predt=by(fertility[,-2],fertility[,2],function(z){
  mod=mgcv::gam(fertility ~ s(AGE), data=z)
  pred = predict(mod, newdata = data.frame(AGE = z$AGE))
  pred
})
fertility$pred = unlist(predt)

      

+2


source







All Articles