# Comparing two groups of count data

``````set.seed(100) # this is just a way to keep "randomly" generated numbers from changing every time we run the script
g1<-rpois(n=100, lambda=2)
g2<-rpois(n=100, lambda=7)
hist(g1, freq=T, xlim=c(0,17),
col= trcol, breaks=5,
main="Count data", xlab="N")
hist(g2, freq=T, add=T, col= trcol, breaks=15)``````

The frequency distributions of the two groups are asymmetrical, a clue that they are not normal.

``````# Let's run some analyses
# organise data in a dataset:
dd <- data.frame(
group = c(rep("group1", 100),rep("group2", 100)),
values = c(g1, g2)
)

Let’s play dumb and use a general linear model approach:

``````lm1 <- lm(values~group, data=dd)
summary(lm1)
# check the model assumptions (graphically):
par(mfrow=c(3,2))
plot(lm1)``````

Diagnostics plots:

1. Residuals vs fitted. This plot shows if residuals have non-linear patterns, but it also gives an indication on whether data are homoskedastic, i.e. whether mean and variance are unrelated. In our case variability increases with the mean: this pattern is called heterosckedasticity, and it violates one of the assumptions that data must observe to be analysed with general linear models.
2. Q-Q plot. This plot shows if residuals are normally distributed. In our case they aren’t. For interpreting Q-Q plots see here and here – many other examples online and on books.
3. Scale-Location: to check the assumption of equal variance (homoscedasticity). For the data in exam, Scale-Location shows that the larger the fitted values, the more the variability around them. In other words, mean and variance are correlated.
4. Residuals vs Leverage. Are there any outliers? Not in our case there might be three.
5. Extra aid for evaluating the distribution of residuals:
``hist(lm1\$resid)``

How to deal with non-normality? One way is to normalise the data by transforming them:

``````dd\$logvals <- log(dd\$values)
lm2 <- lm(logvals ~ group, data=dd)``````

Here is one problem with log-transformations: values=0 need to be made arbitrarily different from zero, hence adding an extra transformation for our data:

``````dd\$logvals2 <- log(dd\$values+1)
par(mfrow=c(1,2))
hist(dd[dd\$group=="group1",]\$logvals2, freq=T,
xlim=c(-2,4), ylim=c(0,60),
col= trcol, breaks=5,
xlab="log(count data)", main=""
)

hist(dd\$logvals2[which(dd\$group=="group2")], freq=T, add=T, col= trcol, breaks=4)
hist(dd[dd\$group=="group1",]\$logvals2, freq=T,
xlim=c(-2,4), ylim=c(0,60),
col= trcol, breaks=5,
xlab="log(count data+1)", main=""
)
hist(dd\$logvals2[which(dd\$group=="group2")], freq=T, add=T, col= trcol, breaks=4)``````
``````# Now lm() runs.
lm3 <- lm(logvals2 ~ group, data=dd)
summary(lm3)
# Note that it gives log(estimates)
# estimates = exp(log(estimates))``````

Because of the issues above, some have suggested: http://www.imachordata.com/do-not-log-transform-count-data-bitches/

``````glm1 <- glm(values ~ group, data=dd, family="poisson"(link = "log"))
summary(glm1)``````

The model estimates both the parameters AND their associated Std. Errors on the link scale (in our example, link=log), and the statistical tests are run assuming that the residuals follow not a Normal, but a Poisson distribution.

To back-transform the parameters:

``````mean1 <- exp(summary(glm1)\$coefficients[1,1])
mean2 <- exp(summary(glm1)\$coefficients[1,1]) + exp(summary(glm1)\$coefficients[2,1])``````

And to obtain the Std.Error on the original scale? Just doing:

``exp(summary(glm1)\$coefficients[,2]) # WRONG!``

Is not correct (to find out why, this is a good starting point).

One way is to use function `predict()`:

``````preddata <- data.frame(group=unique(dd\$group))
preds <- predict(glm1, newdata=preddata, type="response", se.fit=T)
preds\$se.fit``````

For a detailed comparison of four different ways to calculate standard errors and plot  95% confidence intervals, click here.

Did you enjoy this? Consider joining my on-line course “First steps in data analysis with R” and learn data analysis from zero to hero!