ANOVA, model selection, and pairwise contrasts among treatments using R

Some time ago I wrote about how to fit a linear model and interpret its summary table in R. At the time I used an example in which the response variable depended on two explanatory variables and on their interaction. It was a rather specific article, in which I overlooked some essential steps in the process of selection and interpretation of a statistical model. In the present article and in the relative worked examples I will expand the topic a bit, explaining 1) how to select the most parsimonious model relatively to a dataset using function anova() and 2) how to use summary(), together with relevel(), for testing for significant differences between pairs of experimental treatments (R script here).

As a reminder, linear models can always be written in the form:

Yi = μ + Ai + Bi + Ci + … + Ai*Bi + Ai*Ci + Bi*Ci + Ai*Bi*Ci + … + ɛi


  • μ is the overall mean of all the Yi
  • Ai, Bi, and Ci are the differences between Yi and the overall μ due to the main effects of A, B, and C
  • Ai*B, etc are the differences between Yi and the overall μ due to interactions among explanatory variables
  • ɛi is the difference between Yi and μ left unexplained by the model.

A linear model can be thought as the mathematical transcription of our research question. What is the effect of factors A, B, and C on our response variable? Does the effect of each factor on Y depend on other factors? We answer these questions by testing if models including all factors and all their interactions explain significantly more variability in Y than simpler models (including less factors and/or less interactions) do.

In general, the golden steps are:

  1. define a clear, testable question;
  2. design the study/experiment accordingly;
  3. collect your data;
  4. PLOT THE DATA and start doing qualitative interpretation of the study results;
  5. test your initial question(s) by defining the corresponding model in a statistical software (e.g. R);
  6. assess if the model assumptions are respected (*);
  7. select the most parsimonious model, e.g. using anova() or AIC();
  8. examine the model outcome using summary().

Example 1. One-way ANOVA

The question: do chick weight differ if they are fed differently?

# Load data:
data(chickwts) # chickwts is one of the many data sets built 
# in the R software by default. 
# data() is a function specific for calling these datasets.

boxplot(weight ~ feed,
    data = chickwts,
    ylab="Weight (grams)")

It seems that there are indeed differences among treatments. Let’s check this statistically:

chick1 <- lm(weight ~ feed, data = chickwts)
par(mfrow=c(2,2)) # prepare a window with room for 4 plots
plot(chick1) # test the assumptions graphically.
These diagnostic plots show that the assumptions for performing a reliable linear model are fairly respected. Check out an R manual such as Crawley’s “The R Book” or Beckerman and Petchey’s “Getting Started with R” for details on how to interpret them.

To know whether the feeding treatment is relevant for predicting chick weight, we must compare this model with the simpler one:

chick0 <- lm(weight ~ 1, data = chickwts) 
# in this case, this is also the simplest model possible.

Which model is more parsimonious?

anova(chick1, chick0)
Analysis of Variance Table

Model 1: weight ~ feed
Model 2: weight ~ 1
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     65 195556                                  
2     70 426685 -5   -231129 15.365 5.936e-10 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This test checks if the more complex model explains significantly more variability than the simpler one. In this case the answer is yes, that is, adding information about the food provided to chicks allows more precise estimates of their body weight.

Now, how to estimate whether pairs of specific treatments differ significantly? R shows it in the model’s summary table:

lm(formula = weight ~ feed, data = chickwts)
     Min       1Q   Median       3Q      Max 
-123.909  -34.413    1.571   38.170  103.091 
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    323.583     15.834  20.436  < 2e-16 ***
feedhorsebean -163.383     23.485  -6.957 2.07e-09 ***
feedlinseed   -104.833     22.393  -4.682 1.49e-05 ***
feedmeatmeal   -46.674     22.896  -2.039 0.045567 *  
feedsoybean    -77.155     21.578  -3.576 0.000665 ***
feedsunflower    5.333     22.393   0.238 0.812495    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 54.85 on 65 degrees of freedom
Multiple R-squared:  0.5417,    Adjusted R-squared:  0.5064 
F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10

The treatment levels are ordered alpha-numerically, so that (intercept) refers to the mean weight of chicks fed on casein: 323.583 grams (SE=15.834). The estimate is significantly different from zero (look at the p-value). Estimates for all other levels are given as differences from the reference level. For example, the mean weight of chicks fed on horsebean is (323.583 – 163.383) grams. The p-value shows that the two averages differ significantly.

Now, what if we were interested in the difference in weight of chicks fed on meatmeal versus soybean? The mean weights are (323.583 – 46.674) grams and (323.583 – 77.155) grams respectively, but are they significantly different from each other? The p- values provided refer to their difference from the reference level, so they cannot help us. The trick is to use function relevel():

chickwts$feed <- relevel(chickwts$feed, ref = "meatmeal")
# this does NOT change any data, it just tells R 
# to consider the specified treatment as reference level in the summary.
# IMPORTANT! Re-run the linear model:
chick1 <- lm(weight ~ feed, data = chickwts)

lm(formula = weight ~ feed, data = chickwts)

     Min       1Q   Median       3Q      Max 
-123.909  -34.413    1.571   38.170  103.091 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     276.91      16.54  16.744  < 2e-16 ***
feedcasein       46.67      22.90   2.039   0.0456 *  
feedhorsebean  -116.71      23.97  -4.870 7.48e-06 ***
feedlinseed     -58.16      22.90  -2.540   0.0135 *  
feedsoybean     -30.48      22.10  -1.379   0.1726    
feedsunflower    52.01      22.90   2.271   0.0264 *  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 54.85 on 65 degrees of freedom
Multiple R-squared:  0.5417,    Adjusted R-squared:  0.5064 
F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10

The two levels do not differ significantly, so the feeding regimes “meatmeal” and “soybean” produce chicks with comparable weights. This releveling process can be repeated untill all (relevant) pairwise comparisons have been performed.

Example 2. Two-way ANOVA

The same ideas discussed for a linear model with only one explanatory variable can be extended to linear models with two or more explanatory variables.

Let’s say we want to know whether the effect of food on chick weight differ among chicks of different breeds. We would have to run a new experiment, and then analyze the data as follows. First, run the three models we would like to compare:

chick_interaction <- lm(weight ~ feed * breed, data = FoodBreedExpt)
chick_additive <- lm(weight ~ feed + breed, data = FoodBreedExpt)
chick_feedonly <- lm(weight ~ feed, data = FoodBreedExpt)

 [note: FoodBreedExpt is an imaginary dataset. These data are not provided]

Then, compare the models using anova():

anova(chick_interaction, chick_additive)

If model chick_interaction is significantly better than chick_additive, it means that chicks from different breeds respond to different feeding regimes differently. If chick_additive is more parsimonious, then we should check it against chick_feedonly:

anova(chick_additive, chick_feedonly)

If chick_additive still “wins”, that means that chicks from different breeds show different body weights overall, but the differences in chick weight among feeding treatments are similar regardless of their breed.

 Once the most parsimonious model is selected, then we can proceed using summary() and relevel() to look at differences between pairs of treatments into detail. The procedure is the same shown for Example 1 in the R script. See also this article.

I hope these notes helped. There are other ways to accomplish the result shown above. For example, some people prefer to use aov() to perform Analysis of Variance and TukeyHSD() for pairwise contrasts. The procedure I described is fairly general (works for regression, ANOVA, ANCOVA and mixed effect models in the same way) and it works pretty well for me.

An alternative tool for model selection is the Akaike Information Criterion (AIC). Here are some considerations on AIC by Brian McGill. The AIC estimates the goodness of a model by measuring how well the model fit the data, while penalising models according to their number of parameters. In other words, it favours the simplest model that describes the data best. The lower the AIC, the better the model. For small sample sizes the corrected AIC (AICc) should be used. See: AICc() in package MuMIn. For large sample sizes, AICc ≈ AIC. The minimum AIC difference for a model to be considered significantly better at describing the data is debated; Harrison et al. (2018) suggest ∆AIC=6. For competing models with ∆AIC<6, I consider the simplest to be the best.

Note that I am not a statistician, so I invite to check the information shown above on a text book if something I wrote sounds weird (or drop me an email). There is plenty of good stats books out there. Among introductory books for R, I particularly like Beckerman and Petchey’s “Getting Started with R”.