The following is a graphical explanation of the advantage of accounting for non-independence of observations when testing for differences between groups.
rm(list=ls())
set.seed(66)
# simulate data
x1 <- rnorm(n=100, mean=30, sd= 10)
x2 <- rnorm(n=100, mean=60, sd= 10)
# arrange the data in a dataset
dd <- data.frame(ID=rep(paste("ID", seq(1,100, by=1), sep="_"),2),
response=c(x1,x2),
diet=c(rep("A", 100), rep("B", 100))
)
dd$response2 <- c(sort(x1, decreasing = FALSE), sort(x2, decreasing = TRUE))
Diets A and B refer to the same 100 subjects. For example, they may represent the body fat percentage of 100 pigs when being fed with diet A and diet B. The body fat distribution for Diet A is the same in response
and response2
; so is the body fat distribution for Diet B.
In response
, data points from groups A and B are not associated according to any pattern (Fig.1: a;c): body fat percentage is overall higher for Diet B, but each subject responds differently to the change in diet. In response2
, body fat percentage is higher for Diet B than in is for Diet A for all subjects (Fig.1: b;d). The scenario described by response2 represents a more uniform response to the change in diet: in other words, the response to the change in diet has less intra-individual variability.
If we test for differences between response
and response2
in the two groups without accounting for subject identity, the t-tests give exactly the same results:
t.test(dd$response[which(dd$group=="A")],
dd$response[which(dd$group=="B")],
paired=F, var.equal=T
)
# Two Sample t-test
# data: dd$response[which(dd$group == "A")] and dd$response[which(dd$group == "B")]
# t = -22.9, df = 198, p-value <2e-16
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -32.549 -27.388
# sample estimates:
# mean of x mean of y
t.test(dd$response2[which(dd$group=="A")],
dd$response2[which(dd$group=="B")],
paired=F, var.equal=T
)
# Two Sample t-test
# data: dd$response2[which(dd$group == "A")] and dd$response2[which(dd$group == "B")]
# t = -22.9, df = 198, p-value <2e-16
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -32.549 -27.388
# sample estimates:
# mean of x mean of y
# 30.403 60.371
By running paired t-tests we account for the non-independence of the observations under Diet A and Diet B. This allows to reduce the amount of unexplained variability, thus increasing the signal-to-noise ratio:
t.test(dd$response[which(dd$group=="A")],
dd$response[which(dd$group=="B")],
paired=T, var.equal=T
)
# Paired t-test
# data: dd$response[which(dd$group == "A")] and dd$response[which(dd$group == "B")]
# t = -22.5, df = 99, p-value <2e-16
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -32.607 -27.331
# sample estimates:
# mean of the differences
# -29.969
t.test(dd$response2[which(dd$group=="A")],
dd$response2[which(dd$group=="B")],
paired=T, var.equal=T
)
# Paired t-test
# data: dd$response2[which(dd$group == "A")] and dd$response2[which(dd$group == "B")]
# t = -177, df = 99, p-value <2e-16
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -30.305 -29.632
# sample estimates:
# mean of the differences
# -29.969
The t-value measures the size of the difference between two groups relative to the variability in the data. The closer the value of t is to zero, the more likely it is that the two groups are not significantly different. Note how the module of the t-value is higher in the latter case.
Did you enjoy this? Consider joining my on-line course “First steps in data analysis with R” and learn data analysis from zero to hero!