Generalised additive models (hereafter: GAM) provide a tool for modelling non-linear correlations between a response variable and its predictors. Generalised Additive Models are “generalized linear models with a linear predictor involving a sum of smooth functions of covariates” (S.N. Wood, 2017). In other words, they estimate functional relationships between predictors and the response variable by approximating continuous predictors using smooth curves, or “smoothers”. Smoothers are functions which are, themselves, sums of functions. The functions that, summed together, produce a smoother are called its basis functions.
The practical outcome is that GAM can produce “wiggly” model fits, similar to what polynomials do, but without the overfitting (i.e. the excessive wiggliness) that tends to characterise high-degree polynomial models (more details and references here).
Here, Mitchell Lyons shows how to use a GAM to model a non-linear correlation between a response variable y and a single continuous predictor x. In the following example I build on Lyons’s example by adding a categorical predictor to the data and to the model, and by exploring the interaction between the two predictors.
(a) The two groups have the same correlation with x
In this case, variable group does not improve the explanatory power of the model and can be dropped:
mgcv::gam(y ~ s(x), data=dd, method = "REML")
(b) An “additive” GAM
In panel (b), mean y differ in the two groups, but the correlation between x and y is the same in the two groups. This situation can be described by a GAM in which x and group do not present any interaction:
mgcv::gam(y ~ group + s(x), data=dd, method = "REML")
(c) An “interactive” GAM
In panel (c) the two groups differ not only in their mean y, but also in the shape of the correlation between y and x: in group A, the amplitude of fluctuations in y is constant; in group B, the amplitude of the fluctuations in y increases as x increases.
In GAMs, interactions between continuous and categorical predictors can be modelled in different ways – see here and here. Here are three possible models:
(c.1) An interaction-only GAM
We can fit a model that assumes a pure interactive effect of group
and x
on y
, and that neither predictor has a significant effect on y
on its own :
mgcv::gam(y ~ s(x, by=group), data=dd, method = "REML")
In this case, this model clearly does not capture well the observed variability.
(c.2) A GAM accounting for a “group” main effect and an interaction between x and “group”
mgcv::gam(y ~ group + s(x, by=group), data=dd, method = "REML")
(c.3) A GAM accounting for an effect of group, x, and their interaction
mgcv::gam(y ~ group + s(x) + s(x, by=group, m=1), data=dd, method = "REML")
For this model formulation, Gavin Simpson et al. suggests to specify that the penalty matrix is estimated on the first derivatives by indicating m=1
(see here and here). I take their word for it. The order of the derivatives for which the penalty matrix is calculated can matter in some situations, but in this case it does not make much of a difference in how well the models fit the data.
Models c.2 and c.3 are very similar in how well they capture the variability in the data, but their AIC values suggest that c.2 is more parsimonious.
Did you enjoy this? Consider joining my on-line course “First steps in data analysis with R” and learn data analysis from zero to hero!