Mixed-effect models (MEMs) are useful to deal with unbalanced study designs and/or with non-independent data. In the context of MEMs, explanatory variables are distinguished in fixed effects and random effects. To give an operational definition, a model’s fixed effects are the explanatory variables the effect of which we are interested to quantify explicitly; a model’s random effects are variables that are not the main focus of our study but that we want to include in the model because we expect them to account for some of the residual variability (more on fixed and random effects: here, here). We can think of random effects as sources of background noise that is important to estimate in order to obtain a more accurate estimate of the fixed effects. Mixed-effect models allow to specify explanatory variables as fixed or random, which are then modelled accordingly.
Some literature references for getting started:
- “A brief introduction to mixed effects modelling and multi-model inference in ecology” by Harrison et al. (2018).
- An excellent book for understanding MEMs and implementing them in R is A Beginner’s Guide to GLM and GLMM with R: A Frequentist and Bayesian Perspective for Ecologists by Alain F. Zuur, Joseph M. Hilbe, and Elena N. Leno.
- An example by Ignasi Bartolomeus (the issue with his example is that the random effect has only three levels. While this is helpful for teaching purposes, they recommend the random effect to have at least 5-6 levels for it to give reliable estimates).
A worked example
The following example in imaginary, but based on real-world data. We are looking at how resource availability affects unicorns’ social patterns: specifically, we are looking at how unicorn herd size changes with food quantity. We surveyed unicorn herd sizes at seven localities over the course of twelve months. Food quantity was assessed monthly, so for each locality at each month we have one value of food availability (continuous predictor, let’s say grass cover). For each locality on each month, unicorn herds were re-surveyed over four consecutive days to account for both unicorn movement and possible sampling error. For each sampling day there are as many data points as there were unicorn herds.
Data available here:
unicorns <- read.csv(text= RCurl::getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/unicorns.csv"))
It’s good practice to start with a visual exploration of the data, which suggests that herd size increases with food availability (Fig.1):
Looking at the data separately by locality it appears that the relationship between food availability and herd size differ among localities (Fig. 2):
The relationship between food availability and herd size differs also among survey months, but by eyeballing the graphs it looks like survey month is a smaller source of variability compared to locality (Fig. 3):
The aim is to fit a model that:
- captures the possible relationship between herd size and food availability;
- accounts for the data being non-normally distributed;
- accounts for the possible sources of background variability, namely: inter-locality variability and inter-month variability;
- accounts for the non-independence of measurements performed over consecutive days in the same month.
The powerful function glmmTMB
, from the homonymous package, allows to do all of the above.
# factorize survey_day so that glmmTMB can use it to estimate temporal autocorrelation:
unicorns$survey_day_f <- glmmTMB::numFactor(unicorns$survey_day_n)
(See here, here, and here for details on how to model covariance structure/autocorrelation in glmmTMB.)
unicorns_glmmTMB0 <- glmmTMB::glmmTMB(Herd_size_n ~ food.quantity
+ (1 + food.quantity | Locality)
+ (1 + food.quantity | Year_Month)
+ ou(survey_day_f + 0 | Locality), # autocorrelation term
family="poisson", # adequate for count data
data=unicorns)
# Error in eigen(h) : infinite or missing values in 'x'
The model above is the best we could fit, given the information available: it allows for Herd_size_n ~ food.quantity
to differ in intercept and slope at the level of Locality
and Year_Month
, and it accounts for temporal autocorrelation. Unfortunately, this model fails to converge, meaning that it’s probably too complicated for all its parameters to be estimated reliably given the amount of data available.
Dealing with convergence issues
To address convergence issues I follow this pipeline, based on information available here, here, and here:
- rescale the predictor(s);
- change optimiser;
- simplify the model.
Refitting the model after scaling food.quantity
works but doesn’t converge:
dd.CS <- transform(unicorns, food.quantity =scale(food.quantity))
unicorns_glmmTMB.cs <- glmmTMB::glmmTMB(Herd_size_n ~ food.quantity
+ (1 + food.quantity | Locality)
+ (1 + food.quantity | Year_Month)
+ ou(survey_day_f + 0 | Locality),
family="poisson",
data= dd.CS)
# Warning messages:
# 1: In fitTMB(TMBStruc) :
# Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
# 2: In fitTMB(TMBStruc) :
# Model convergence problem; singular convergence (7). See vignette('troubleshooting')
Changing optimiser does not help either:
unicorns_glmmTMB.csb <- update(unicorns_glmmTMB.cs,
control=glmmTMBControl(optimizer=optim,
optArgs=list(method="BFGS")))
# see also package optimx
# Warning message:
# In fitTMB(TMBStruc) :
# Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
A simpler model, which allows for random slopes for localities but not for survey months, still does not converge:
unicorns_glmmTMB <- glmmTMB(Herd_size_n ~ food.quantity
+ (1 + food.quantity | Locality)
+ (1 | Year_Month)
+ ou(survey_day_f + 0 | Locality),
family="poisson",
data= unicorns)
Rescaling, changing optimiser to BFGS
, or doing both did not help. The next step of model selection/simplification led me to a model allowing only for random intercept for both Locality
and survey month (Year_Month
):
unicorns_glmmTMB_noslope <- glmmTMB::glmmTMB(Herd_size_n ~ food.quantity
+ (1 | Locality)
+ (1 | Year_Month)
+ ou(survey_day_f + 0 | Locality),
family="poisson",
data= unicorns)
unicorns_glmmTMB_noslope$sdr$pdHess ## a test for convergence. There is convergence
Although Fig. 2 suggests that including (1 + food.quantity | Locality)
would better reflect the nature of my data compared to a model including only (1 | Locality)
, model unicorns_glmmTMB_noslope
is the best that can be fitted to the data.
summary(unicorns_glmmTMB_noslope)
# Conditional model:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.125 0.787 -0.16 0.8735
# food.quantity 0.715 0.273 2.62 0.0089 **
Model selection
Model selection between this model and simpler ones (e.g. not including Locality
, Year_Month
, or either) can be done by comparing their corrected AIC (hereafter: AICc). The correction in AICc is an adjustment for small sample sizes, which is not really necessary here but does not hurt (for large sample sizes, AICc ~ AIC). See: MuMIn::AICc()
. The best model is the simplest one with the lowest AICc. The minimum AIC difference for a model to be considered significantly better at describing the data is debated; Harrison et al. (2018) suggest ∆AIC=6. For competing models with ∆AIC below the threshold, I consider the simplest to be the best.
Header image: Miniature of a unicorn, in Philes, De natura animalium (France, 16th century). London, British Library. Public domain. Source
Did you enjoy this? Consider joining my on-line course “First steps in data analysis with R” and learn data analysis from zero to hero!