This is my attempt at explaining what multivariate analysis is and does to the uninitiated.
Put very simply, when we use a frequentist approach to evaluate if/how groups differ with regards to a certain variable, we:
- 1) measure that variable in a bunch of individuals,
- 2) calculate the relative frequency of each value taken by the variable of interest,
- 3) look at the frequency distribution of that variable in each group, and
- 4) make a call about whether it has the same frequency distribution in each group.
Visually, we can think of statistical tests such as t test and ANOVA as tests that look at how much, and how well, the frequency distributions of our study groups overlap. For example, this is the approach if we want to compare three Iris species by their sepal length (e.g. top panel of the figure attached).
Multivariate statistics come into play when we want compare groups based on more than one variable.
If we want to compare groups according to two variables, now we are comparing the distribution of the measurements belonging to each of our study groups in a 2-dimensional space, but we still want to check if such distributions are similar based on their location, spread, and shape. For example, this is the approach if we want to compare three Iris species by their sepal length and petal length at the same time (e.g. bottom right panel of the figure). The question we are trying to address is: how well do the data patches of the three species overlap?
If our comparison is based on three variables, now the observations of each study group are going to be spread in a 3-dimensional space. In this case we are not dealing with 2-dimensional data “patches” anymore, but with 3-dimensional data “clouds”. Despite the addition of a third dimension, we can compare such data clouds based on their location, spread, and shape, just like we did with 2-dimensional data “patches”.
To compare groups based on more than three variables, the same principles apply. Now we will be comparing “multidimensional hyper-clouds”, so visualising them graphically becomes tricky, but the statistical approach is the same.To visualise multidimensional data we will need to use ordination to reduce their dimensionality, so that we can squish them in a 2-D or 3-D space. That’s where ordination methods come into play. We can think of ordination methods as different ways of taking photographs of reality. Reality is in 3D, but we can take a picture of it to represent it in a 2-dimensional space. This process happens at the cost of some information (just like we cannot see behind a table of which we only have a picture) but has its perks (we can print a picture and carry it with us). Ordinations allow to display the data in 2D or 3D in a way that reflects the true distances among data as faithfully as possible, much like one can look for the best angle to take a picture of a room to underline its tridimensionality and to show as many details as possible.
Did you enjoy this? Consider joining my on-line course “First steps in data analysis with R” and learn data analysis from zero to hero!
R code:
rm(list=ls()) # wipe R's memory clean.
data(iris)
library(ggplot2)
g1 <- ggplot(iris, aes(x= Sepal.Length, fill=Species)) +
geom_density(alpha=.5) + theme_classic() +
xlim(3, 9) + labs(x="Sepal length (mm)", y="Density")
g2 <- ggplot(iris, aes(x= Petal.Length, fill=Species)) +
geom_density(alpha=.5) + theme_classic() +
# coord_flip() +
# scale_x_reverse() +
xlim(0, 8) +
# theme(legend.position = "none") +
labs(x="Petal length (mm)", y="Density")
g3 <- ggplot(iris, aes(y = Petal.Length, x = Sepal.Length, color=Species)) +
geom_point(alpha = 0.7, pch=19, cex=3) +
geom_density_2d(contour_var="ndensity") + theme_bw() +
ylim(0, 8) + xlim(3, 9) +
labs(x="Sepal length (mm)", y="Petal length (mm)")
# combine plots:
ggpubr::ggarrange(g1, g3,
ncol = 1, nrow = 2
)
dev.new(); g2
# for the life of me I could not rotate g2 the right way round to produce the figure, so a bit of post-production in an image editing program was needed.