Count data are generally modelled by accounting for their residuals to follow a Poisson distribution. Data expressed as ratios are generally modelled by accounting for their residuals to follow a binomial distribution. In both cases, too many zeroes in the response variable can become a problem: the overabundance of zeroes, known as zero-inflation, can translate in overdispersion, therefore making the distribution of choice (Poisson or binomial) a poor fit to the residuals.
One way to address zero-inflation is to use the zero-inflated versions of Poisson and binomial distributions. These zero-inflated distributions treat the data as a mixture of data generated by a Poisson or a binomial process, and data that belong to a zero-only distribution (in other words, a distribution with mean of 0 and variance of 0).
According to Paul Allison, in the case of zero-inflated sets of ratios, modelling the data according to a negative binomial distribution can be an alternative to zero-inflated binomial distributions, providing a better fit and requiring less computational power. He concludes: “having a lot of zeros doesn’t necessarily mean that you need a zero-inflated model.” His considerations and his exchanges with William Greene and Dominique Lord (see this article and its comment section) are a worthy read.
As a reminder, here are some definitions:
- the Poisson distribution is “a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event” (reference);
- the Binomial distribution estimates the probability of a number of independent events equal to X to be observed in n trials (reference);
- the Negative Binomial distribution estimates the probability that a number of trials equal to Y is required until an event is observed for the rth time (reference).
Modelling zero-inflated datasets with glmmTMB
Package glmmTMB allows to:
- “specify a zero-inflation model (via the ziformula argument) with fixed and/or random effects” (Ben Bolker)
- specify a negative binomial model using
family=nbinom1, which assumes Var∝𝜇, or
family=nbinom2, which assumes Var=𝜇*(1 + 𝜇/𝑘). For both nbinom2 and nbinom2 it is possible to specify if the dispersion parameter is affected by any of the explanatory variables using the argument
A negative binomial distribution and the zi argument can be used together: “
ziformula specifies zero-inflation [while]
family=nbinom1] take care of other sources of overdispersion” (Ben Bolker).
ziformula is “a one-sided (i.e., no response variable) formula for zero-inflation combining fixed and random effects: the default
~0 specifies no zero-inflation. Specifying
~. sets the zero-inflation formula identical to the right-hand side of formula (i.e., the conditional effects formula); terms can also be added or subtracted. When using
~. as the zero-inflation formula in models where the conditional effects formula contains an offset term, the offset term will automatically be dropped. The zero-inflation model uses a logit link” (from
?glmmTMB). It “describes how the probability of an extra zero (i.e. structural zero) will vary with predictors” (see this excellent vignette by Mollie Brooks). By default, glmmTMB excludes zero-inflation; specifying
zi=~0 does so explicitly. Specifying
zi=~1 applies a single zero-inflation parameter to all observations.
Other useful links
Did you enjoy this? Consider joining my on-line course “First steps in data analysis with R” and learn data analysis from zero to hero!