Accounting for phylogenetic relatedness in statistical models

A special case of data non-independence is represented by the phylogenetic relationship among species. Let’s say that we want to study the correlation between dietary habits and mean body weight in mammals. The body weight of different species may be affected by their dietary habits, but their phylogenetic relationship may also matter – closely related species may have similar mean body weight just because of their common evolutionary history. To study the correlation between dietary habits and mean body weight in mammals, one has to account for phylogenetic signal. One quick and dirty solution is to use mixed-effect models with taxonomic levels specified as nested random effects:

MEM.model <- lme4::lmer(body.weight ~ dietary.pref 
            + (1 + dietary.pref | family/genus/species), 
            data = mammals) 
# "mammals" is an imaginary dataset. It would have to contain, for each data point, information about body.weight and dietary.pref as well as the family/genus/species to which they belong.

A more specific alternative is to integrate phylogenetic information into the analysis. This is something I have only dabbled in, so consider the following notes as a starting point (updated to 2021). Phylogenetic generalized least squares (PGLS) allow to inform the model about the autocorrelation between taxonomic units using existing phylogenetic trees. Here is an example where they look at the relationship between wing length and tarsus length among Geospiza finch species. This approach only allows for one data point for each leaf of the phylogenetic tree. Package phyr can perform generalised mixed-effect models that account for phylogenetic relatedness and can deal with data sets in which every taxonomical unit is represented by multiple data points (see here and here).

Did you enjoy this? Consider joining my on-line course “First steps in data analysis with R” and learn data analysis from zero to hero!