Comparing two groups: t-test

Let’s compare the petal length of two Iris species, I. setosa and I. versicolor.

data(iris)
 dd <- subset(iris, select=c(Species, Petal.Length), Species!="virginica")
 dd$Species <- as.character(dd$Species)

Always start from plotting your data! Get a visual understanding of the data and patterns (or lack thereof) before running analyses.

 boxplot(Petal.Length ~ Species, data=dd)
 # boxplot() creates a box-and-whiskers plot.

I could have just used function plot(). When plotting data versus a categorical variable, R uses a box-and-whiskers plot by default.

About box-and-whiskers plots in R:

  • The thick line is not the mean but the median
  • the two bases of the box (calles “hinges” in the help file) represent the first and third quartile, so the box contains ~50% of the values
  • the ends of the “whiskers” delimit the 95% confidence interval.
 # there's a function for t tests: t.test()
 t.test(Petal.Length ~ Species, data=dd) # Welch's t-test
 t.test(Petal.Length ~ Species, data=dd, var.equal=TRUE) # Student's t-test
 # spot the differences! 

IMPORTANT! Whether it’s a t-test, ANOVA, ANCOVA, linear regression, multiple linear regression… Statistically speaking we are always testing what’s called a “linear model” – very roughly said, it is a model that estimates linear changes in the response variable in relationship with changes in the explanatory variables.

All general linear models can be performed in R using function lm(), and then tested/examined using functions anova() and summary().

 test1 <- lm(Petal.Length ~ Species, data=dd)
 test0 <- lm(Petal.Length ~ 1, data=dd) # null model
 anova(test1, test0) # anova() compares models testing whether their residual amounts of variability differ significantly
 summary(test1) 
# Code for the figure above:
 par(mfrow=c(2,1))
 
 hist(subset(dd, Species!="versicolor")$Petal.Length, freq=T, xlim=c(0,6), col=grey(0.3), breaks=c((1:20)/2), ylim=c(0, 40), main="", xlab="Petal length (cm)")
 hist(subset(dd, Species=="versicolor")$Petal.Length, freq=T, xlim=c(0,6), col=grey(0.75), breaks=c((1:20)/2), add=T)

 legend("topright", bty="n",
 legend=c("Iris setosa", "I. virginica"),
 pch=22, pt.cex=3,
 pt.bg=c(grey(0.3), grey(0.75))
 ) 
 
 plot(density(subset(dd, Species!="versicolor")$Petal.Length, bw = "sj", adjust=2), xlim=c(0,6), main="Density distributions", xlab="Petal length (cm)")
 lines(density(subset(dd, Species=="versicolor")$Petal.Length, bw = "sj", adjust=2), col="red")
 text(x=c(2.2, 4.5), y=c(1.5, 0.7), 
 labels=c("Iris setosa", "I. virginica"),
 col=c("black", "red")
 )

 # adjust=... is an extra argument for density() lty="dashed" ) 
 # note how I used the ash symbol to make a piece of code "silent". 
 # I find this useful when I am fiddling with code, to silence sections of it without deleting them.