# Some notes on Principal Component Analysis

Principal Component Analysis (PCA) is an ordination method that, given N explanatory variables, creates a new set of N explanatory variables with two main characteristics: 1) they are all orthogonal, therefore independent, to each other and 2) they are ranked by importance: the first PC is the one that explain the most variability, the Nth is the one that explains the least. Because of these features, PCA is sometimes used to reduce the dimensionality of multivariate data by selecting few the two or three PC that explains the most variability.

It is also possible to determine the relative contribution of each of the original variables to each PC.

Click here for a very good, interactive explanation of the idea behind PCA.

Here is a practical example on the “iris” dataset:

``````rm(list=ls()) # it's always good practice to clear R’s memory

data(iris)
#?iris # gives info about the dataset
str(iris)
plot(iris)``````

By plotting the variables against each other it becomes obvious that some are strongly correlated: in other words, there is an overlap in the power of some variables at explaining/accounting for the data variability. A PCA will help disentangling these correlations.

``````# functions prcomp() performs PCA:
fit<-prcomp(iris[-5], scale=TRUE)``````

Use scale=T to standardize the variables to the same relative scale, avoiding some variables to become dominant just because of their large measurement units.

``summary(fit)``

The summary indicates that four PCs where created: the number of possible PCs always equals the number of original variables. The difference is that Principal Components are all orthogonal to each other, therefore independent (their collinearity is zero). PC1 and PC2 explain respectively ~73% and ~23% of the data’s total variability, summing up to a more-than-respectable 96% of the total variability. There is no fixed rule about this, but this already tells us that all the other PCs can be ignored as they explain only crumbs of the total data variability.

``plot(fit,type="lines")``

A “scree plot” allows a graphical assessment of the relative contribution of the PCs in explaining the variability of the data.

``fit ``

fit is the the “Rotation” matrix containing the “loadings” that each of the original variables have on each of the newly created PCs. The concept of eigenvalue would require to be introduced for understanding how the loadings are estimated, and in general for a quantitative understanding of how the principal components are calculated: the interested reader might look it up in the references here and here.

``biplot(fit)``

The arrows provide a graphical rendition of the loadings of each of the original variables over the used PCs.

How to make this PCA plot look better? Package ggplot2 and its derivative ggbiplot can help:

``````pacman::p_load(ggplot2)
# Variances of principal components:
variances <- data.frame(variances=fit\$sdev**2, pcomp=1:length(fit\$sdev))
# **2 means ^2``````

Plot of variances with ggplot:

``````#Plot of variances
varPlot <- ggplot(variances, aes(pcomp, variances))
+ geom_bar(stat="identity", fill="gray") + geom_line()
varPlot``````
``````# install necessary packages, if they aren't already installed, and load them:
library(pacman)