set.seed(100) # this is just a way to keep "randomly" generated numbers from changing every time we run the script g1<-rpois(n=100, lambda=2) g2<-rpois(n=100, lambda=7) trcol <- adjustcolor(c("red", "blue"), alpha=0.4) hist(g1, freq=T, xlim=c(0,17), col= trcol, breaks=5, main="Count data", xlab="N") hist(g2, freq=T, add=T, col= trcol, breaks=15)
The frequency distributions of the two groups are asymmetrical, a clue that they are not normal.
# Let's run some analyses # organise data in a dataset: dd <- data.frame( group = c(rep("group1", 100),rep("group2", 100)), values = c(g1, g2) ) head(dd)
Let’s play dumb and use a general linear model approach:
lm1 <- lm(values~group, data=dd) summary(lm1) # check the model assumptions (graphically): par(mfrow=c(3,2)) plot(lm1)
- Residuals vs fitted. This plot shows if residuals have non-linear patterns, but it also gives an indication on whether data are homoskedastic, i.e. whether mean and variance are unrelated. In our case variability increases with the mean: this pattern is called heterosckedasticity, and it violates one of the assumptions that data must observe to be analysed with general linear models.
- Q-Q plot. This plot shows if residuals are normally distributed. In our case they aren’t. For interpreting Q-Q plots see here and here – many other examples online and on books.
- Scale-Location: to check the assumption of equal variance (homoscedasticity). For the data in exam, Scale-Location shows that the larger the fitted values, the more the variability around them. In other words, mean and variance are correlated.
- Residuals vs Leverage. Are there any outliers? Not in our case there might be three.
- Extra aid for evaluating the distribution of residuals:
How to deal with non-normality? One way is to normalise the data by transforming them:
dd$logvals <- log(dd$values) lm2 <- lm(logvals ~ group, data=dd)
Here is one problem with log-transformations: values=0 need to be made arbitrarily different from zero, hence adding an extra transformation for our data:
dd$logvals2 <- log(dd$values+1) par(mfrow=c(1,2)) hist(dd[dd$group=="group1",]$logvals2, freq=T, xlim=c(-2,4), ylim=c(0,60), col= trcol, breaks=5, xlab="log(count data)", main="" ) hist(dd$logvals2[which(dd$group=="group2")], freq=T, add=T, col= trcol, breaks=4) hist(dd[dd$group=="group1",]$logvals2, freq=T, xlim=c(-2,4), ylim=c(0,60), col= trcol, breaks=5, xlab="log(count data+1)", main="" ) hist(dd$logvals2[which(dd$group=="group2")], freq=T, add=T, col= trcol, breaks=4)
# Now lm() runs. lm3 <- lm(logvals2 ~ group, data=dd) summary(lm3) # Note that it gives log(estimates) # estimates = exp(log(estimates))
Because of the issues above, some have suggested: http://www.imachordata.com/do-not-log-transform-count-data-bitches/
Use GENERALISED linear models instead!
glm1 <- glm(values ~ group, data=dd, family="poisson"(link = "log")) summary(glm1)
The model estimates both the parameters AND their associated Std. Errors on the link scale (in our example, link=log), and the statistical tests are run assuming that the residuals follow not a Normal, but a Poisson distribution.
To back-transform the parameters:
mean1 <- exp(summary(glm1)$coefficients[1,1]) mean2 <- exp(summary(glm1)$coefficients[1,1]) + exp(summary(glm1)$coefficients[2,1])
And to obtain the Std.Error on the original scale? Just doing:
exp(summary(glm1)$coefficients[,2]) # WRONG!
Is not correct (to find out why, this is a good starting point).
One way is to use function
preddata <- data.frame(group=unique(dd$group)) preds <- predict(glm1, newdata=preddata, type="response", se.fit=T) preds$se.fit
For a detailed comparison of four different ways to calculate standard errors and plot 95% confidence intervals, click here.
Did you enjoy this? Consider joining my on-line course “First steps in data analysis with R” and learn data analysis from zero to hero!