Multiple linear regression is used when there are multiple explanatory variables and they are all continuous. Just like linear regression, ANOVA, and ANCOVA, it boils down to testing a general linear model, which in R is usually done with function lm().
rm(list=ls()) # make sure that R's memory is a clean slate. # Some data: DF<-data.frame( y=c(-1.22, -1.73, -2.64, -2.44, -1.11, 2.24, 3.42, 0.67, 0.59, -0.61, -10.77, 0.93, -8.6, -6.99, -0.12, -2.29, -5.16, -3.35, -3.35, -2.51, 2.21, -1.18, -5.21, -7.74, -1.34), x1=c(39.5, 41, 34, 30.5, 31.5, 30, 41.5, 24, 43, 39, 25.5, 38.5, 33.5, 30, 41, 31, 25, 37, 37.5, 24.5, 38, 37, 41, 37, 36), x2=c(61, 53, 53, 44, 49, 44, 57, 47, 54, 48, 46, 59, 46, 61, 55, 57, 59, 59, 55, 50, 62, 55, 55, 52, 55))
Explore the data graphically:
plot(DF) # install.packages("scatterplot3d") and load it all in once: if (!require(scatterplot3d)) install.packages('scatterplot3d'); library(scatterplot3d) # does what the name says s3d <-scatterplot3d(DF$x1,DF$x2,DF$y, xlab="x1", ylab="x2", zlab="y", pch=16, highlight.3d=T, type="p", main="3D Scatterplot") # it is possible to fit flat surfaces (the product of additive models)* with the code below: fit <- lm(y ~ x1+x2,data=DF) s3d$plane3d(fit,lty.box = "solid") # try to interpret the plot qualitatively.
Some more detail: plane3d can only plot flat surfaces, which is why I am using “fit” which is an additive model.
Adding an interaction between x1 and x2 in the model would mean allowing for x1 to affect the effect of x2 on y, and similarly for x2 on x1. This would result in a non-flat surface and plane3d would not handle it. There are tools for plotting non-flat surfaces in R… You can look them up online.
# the usual drill for selecting the best model:
linmod.full <- lm(y~x1*x2, data=DF)
linmod.add <- lm(y~x1+x2, data=DF)
linmod.x1 <- lm(y~x1, data=DF)
linmod.x2 <- lm(y~x2, data=DF)
linmod.null <- lm(y~1, data=DF)
We retain the null model. Variables x1 and x2, by theirselves or together, have no statistically significant explanatory power on y.
Let’s look at a more complicated situation:
dd <- longley # ?longley multilinmod <- lm(Employed ~ GNP + Unemployed + Armed.Forces + Population, data=dd)
This model has more than two explanatory variables, so we cannot represent it graphically.
IMPORTANT: all models have assumptions.
One can just assume that the assumptions are respected and move on (which is what we’ve been doing until now for simplicity), but it’s usually wiser to test those assumptions. If the data don’t respect the assumptions, one can transform the data or tweak the model to deal with it. One crucial assumption of multiple regression is that there should not be multi-collinearity between explanatory variables. What does that mean? For multiple linear regression to be used reliably, the explanatory variables should not be correlated to each other. Correlated explanatory variables are difficult to interpret in multiple regression analyses because they share explanatory power. Consequently, the perceived importance of one explanatory variable depends on other explanatory variables are included in the model.
# Evaluate Collinearity graphically explanatory <- dd[,c("GNP","Unemployed","Armed.Forces","Population")] plot(explanatory) #install.packages("GGally") library(GGally) # a little gimmick ggpairs(dd[,c("GNP","Unemployed","Armed.Forces","Population")]) # Evaluate Collinearity quantitatively: install.packages("mctest") library(mctest) # The ‘mctest’ package in R provides the Farrar-Glauber test and other relevant tests for multicollinearity. Functions ‘omcdiag’ and ‘imcdiag’ provide the overall and individual diagnostic checking for multicollinearity respectively. # More here: https://www.r-bloggers.com/multicollinearity-in-r/ response <- dd[,c("Employed")] omcdiag(explanatory, response)
There is multicollinearity. How to fix this? To start off, one could (should?) prune those explanatory variables that are highly correlated with other variables, setting a threshold such as coeffs > 0.9. In our case, “GNP” and “population” have a correlation coefficient >0.9. Here I am overwriting object “explanatory” with a new one in which I arbitrarily dropped “population” (based on my limited knowledge of the dataset, retaining “population” and dropping “GNP” would have been equally valid):
explanatory <- dd[,c("GNP","Unemployed","Armed.Forces")]
Alternatively we can use Principal Component Analysis (PCA) to derive new variables that explain as much variability as the original ones, but they are by definition orthogonal, i.e. uncorrelated. These new variables are called Principal Components. There are as many Principal Components as the original variables. Principal Components are ranked according to how much variability they explain, so PC1 explains the most variability observed in the data.
More on PCA in general here (it includes a link to an excellent interactive explanation of how PCA works).
#Let's run a PCA: pca1 <- prcomp(explanatory, scale=TRUE) # scale=T deals with variables very different value ranges. See ?prcomp pca1 # the "Rotation" matrix # it contains the "loadings" of each of the original variables on the newly created PCs.
Loadings are proportional to the correlation between each of the original variables and each of the PC. More details here.
pca.scores <- pca1$x # the PCA scores, namely the new coordinates for each data point.# create a new dataset to run our analyses on: newvariables <- as.data.frame(cbind(response, pca.scores)) # test for multicollinearity omcdiag(pca.scores, response) # SOLVED! # how to select the best model? # one way is to compare models with all possible combinations of PC1,PC2 and PC3 - the usual drill: # multilinmod2 <- lm(response ~ PC1+PC2+PC3, data=newvariables) # multilinmod3 <- lm(response ~ PC1+PC2, data=newvariables) # multilinmod4 <- lm(response ~ PC1, data=newvariables) # multilinmod5 <- lm(response ~ PC2, data=newvariables) # et cetera... # anova(multilinmod2, multilinmod3) # et cetera... # there are ways to make this process automatic. One is called stepwise model selection - see: step()
Alternatively we can compare the amount of total variance explained by each PC and drop those that explain less than a certain threshold (5%? 10%?)
vars <- data.frame(variances= pca1$sdev^2, pcomp=1:length(pca1$sdev)) vars$percent.explained <- 100 * vars$variances/sum(vars$variances) vars # In this case we can make our life easier and keep only PC1 and PC2, and do model selection from there: multilinmod3 <- lm(response ~ PC1+PC2, data=newvariables) multilinmod4 <- lm(response ~ PC1, data=newvariables) multilinmod5 <- lm(response ~ PC2, data=newvariables) multilinmod6 <- lm(response ~ 1, data=newvariables) anova(multilinmod3, multilinmod4) # we can drop PC2 anova(multilinmod3, multilinmod5) # but keep PC1 summary(multilinmod4) # this seems the best model
To understand what the model actually says, look into the “Rotation” matrix (see above) and see how much each of the original variables contributes to PC1.