Principal Component Analysis (PCA) is an ordination method that, given N explanatory variables, creates a new set of N explanatory variables with two main characteristics: 1) they are all orthogonal, therefore independent, to each other and 2) they are ranked by importance: the first PC is the one that explain the most variability, the Nth is the one that explains the least. Because of these features, PCA is sometimes used to reduce the dimensionality of multivariate data by selecting few the two or three PC that explains the most variability.
It is also possible to determine the relative contribution of each of the original variables to each PC.
Here is a practical example on the “iris” dataset:
rm(list=ls()) # it's always good practice to clear R’s memory data(iris) #?iris # gives info about the dataset str(iris) head(iris) plot(iris)
By plotting the variables against each other it becomes obvious that some are strongly correlated: in other words, there is an overlap in the power of some variables at explaining/accounting for the data variability. A PCA will help disentangling these correlations.
# functions prcomp() performs PCA: fit<-prcomp(iris[-5], scale=TRUE)
Use scale=T to standardize the variables to the same relative scale, avoiding some variables to become dominant just because of their large measurement units.
The summary indicates that four PCs where created: the number of possible PCs always equals the number of original variables. The difference is that Principal Components are all orthogonal to each other, therefore independent (their collinearity is zero). PC1 and PC2 explain respectively ~73% and ~23% of the data’s total variability, summing up to a more-than-respectable 96% of the total variability. There is no fixed rule about this, but this already tells us that all the other PCs can be ignored as they explain only crumbs of the total data variability.
A “scree plot” allows a graphical assessment of the relative contribution of the PCs in explaining the variability of the data.
fit is the the “Rotation” matrix containing the “loadings” that each of the original variables have on each of the newly created PCs. The concept of eigenvalue would require to be introduced for understanding how the loadings are estimated, and in general for a quantitative understanding of how the principal components are calculated: the interested reader might look it up in the references here and here.
The arrows provide a graphical rendition of the loadings of each of the original variables over the used PCs.
How to make this PCA plot look better? Package ggplot2 and its derivative ggbiplot can help:
pacman::p_load(ggplot2) # Variances of principal components: variances <- data.frame(variances=fit$sdev**2, pcomp=1:length(fit$sdev)) # **2 means ^2
Plot of variances with ggplot:
#Plot of variances varPlot <- ggplot(variances, aes(pcomp, variances)) + geom_bar(stat="identity", fill="gray") + geom_line() varPlot
# install necessary packages, if they aren't already installed, and load them: library(pacman) p_load(devtools) if (!require(ggbiplot)) install_github("vqv/ggbiplot"); library(ggbiplot) Species<-iris$Species iris_pca <- ggbiplot(fit,obs.scale = 1, var.scale=1,groups=Species,ellipse=F,circle=F,varname.size=3) # iris_pca <- iris_pca + theme_bw() # try the above for a less ink-demanding background iris_pca
We can see that petal width and length are highly correlated and their variability across the three Iris species is accounted mainly by PC1, which also explains a large part of variability in sepal length. This would be good to know if we wanted to study the correlation between floral elements and some other variable, say water availability, or the size or relative abundance of a certain pollinator: instead of measuring the correlation between this variable and all the measured response variables separately, we can simply use PC1 and PC2, which are syntetic and uncorrelated to each other. Other uses of PCA include, in pollination ecology, the identification of floral syndromes, namely common sets of floral traits evolved to suit a certain group of pollinators: in our case study, the two PC derived from the size of petals and sepals succeed in identifying the three Iris species and PC1 is enough to discern Iris setosa from the other two species very clearly, possibly suggesting some kind of floral specialization that might be related to pollination strategies (e.g. avoiding shared pollinators with the other two species).
- NOTE. Also see “An introduction to applied multivariate analysis with R” by Everitt and Hothorn, Springer 2011 (thanks to Torsten Hothorn for your help!) and https://onlinecourses.science.psu.edu/stat505/node/54 on the correlation between principal components and original variables.
- Gotelli and Ellison, “A Primer of Ecological Statistics”
- Manly, “Multivariate Statistical Methods”
- Pielou, “The Interpretation of Ecological Data”