Somewhat surprising: we use variances to compare mean values.
Explanation: differences of means contribute to the total variance of the whole sample.
Variance components can be called variance within (\(s^2_\varepsilon\)) and variance between samples.
The way how to separate variances is a linear model.
Example
Two brands of Clementine fruits from a shop “E”, that we encode as “EB” and “EP”. We want to know whether the premium brand (“P”) and the basic brand (“B”) have a different weight.
plot(weight ~ code, data = clem, axes =FALSE)m <-lm(weight ~ code, data = clem)axis(1, at =c(1,2), labels =c("EB", "EP")); axis(2); box()abline(m, col ="blue")
Variance components
We fit a linear model and compare the variances:
m <-lm(weight ~ code, data = clem)
total variance
(var_tot <-var(clem$weight))
[1] 68.98814
residual variance (= within variance)
(var_res <-var(residuals(m)))
[1] 43.25
explained variance (= between variance)
var_tot - var_res
[1] 25.73814
Now we can analyse whether the between variance is big enough to justify a significant effect.
This is called an ANOVA.
ANOVA
anova(m)
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
code 1 566.24 566.24 12.497 0.001963 **
Residuals 21 951.50 45.31
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A t-test for comparison
t.test(weight ~ code, data = clem, var.equal=TRUE)
Two Sample t-test
data: weight by code
t = 3.5351, df = 21, p-value = 0.001963
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
4.185911 16.147423
sample estimates:
mean in group 1 mean in group 2
95.50000 85.33333
\(\Rightarrow\) the p-values are exactly the same.
ANOVA with more than 2 samples
Back to the algae growth data. Let’s call the linear model m:
m <-lm(growth ~ treat, data = algae)
We can print the coefficients of the linear model with summary(m)
But we are interested in the overall effect and use anova
anova(m)
Analysis of Variance Table
Response: growth
Df Sum Sq Mean Sq F value Pr(>F)
treat 6 2.35441 0.39240 25.045 1.987e-06 ***
Residuals 13 0.20368 0.01567
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table shows F-tests testing for significance of all factors.
In the table above, we have only one single factor.
\(\Rightarrow\) We see that the treatment had a significant effect.
Posthoc tests
The test showed, that the factor “treatment” had a significant effect.
We don’t know yet, which factor levels were different.
Tukey HSD test is the most common.
tk <-TukeyHSD(aov(m))tk
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = m)
$treat
diff lwr upr p adj
F. open-Fertilizer 0.91500000 0.56202797 1.26797203 0.0000103
F.+sugar-Fertilizer 0.19266667 -0.16030537 0.54563870 0.5211198
F.+CaCO3-Fertilizer 0.46900000 0.11602797 0.82197203 0.0069447
Bas.med.-Fertilizer 0.85066667 0.49769463 1.20363870 0.0000231
A.dest-Fertilizer 0.15000000 -0.20297203 0.50297203 0.7579063
Tap water-Fertilizer 0.13666667 -0.25796806 0.53130140 0.8837597
F.+sugar-F. open -0.72233333 -1.07530537 -0.36936130 0.0001312
F.+CaCO3-F. open -0.44600000 -0.79897203 -0.09302797 0.0102557
Bas.med.-F. open -0.06433333 -0.41730537 0.28863870 0.9943994
A.dest-F. open -0.76500000 -1.11797203 -0.41202797 0.0000721
Tap water-F. open -0.77833333 -1.17296806 -0.38369860 0.0001913
F.+CaCO3-F.+sugar 0.27633333 -0.07663870 0.62930537 0.1727182
Bas.med.-F.+sugar 0.65800000 0.30502797 1.01097203 0.0003363
A.dest-F.+sugar -0.04266667 -0.39563870 0.31030537 0.9994197
Tap water-F.+sugar -0.05600000 -0.45063473 0.33863473 0.9985686
Bas.med.-F.+CaCO3 0.38166667 0.02869463 0.73463870 0.0307459
A.dest-F.+CaCO3 -0.31900000 -0.67197203 0.03397203 0.0879106
Tap water-F.+CaCO3 -0.33233333 -0.72696806 0.06230140 0.1247914
A.dest-Bas.med. -0.70066667 -1.05363870 -0.34769463 0.0001792
Tap water-Bas.med. -0.71400000 -1.10863473 -0.31936527 0.0004507
Tap water-A.dest -0.01333333 -0.40796806 0.38130140 0.9999997
Graphical output
par(las =1) # las = 1 make y annotation horizontalpar(mar =c(4, 10, 3, 1)) # more space at the left for axis annotationplot(tk)
ANOVA assumptions and diagnostics
ANOVA has same assumptions as the linear model.
Independence of errors
Variance homogeneity
Approximate normality of errors
Graphical checks are preferred.
par(mfrow=c(2, 2))plot(m)
Numerical tests
Test variance homogeneity
F-test compares only two variances.
Several tests for multiple variances, e.g. Bartlett, Levene, Fligner-Killeen
Recommended: Fligner-Killeen-test
fligner.test(growth ~ treat, data = algae)
Fligner-Killeen test of homogeneity of variances
data: growth by treat
Fligner-Killeen:med chi-squared = 4.2095, df = 6, p-value = 0.6483
Test of normal distribution
The Shapiro-Wilks test can be misleading.
Use a graphical method!
qqnorm(residuals(m))qqline(residuals(m))
One-way ANOVA with heterogeneous variances
extension of the Welch test for \(\ge 2\) samples
in R called oneway.test
oneway.test(growth ~ treat, data = algae)
One-way analysis of means (not assuming equal variances)
data: growth and treat
F = 115.09, num df = 6.0000, denom df = 4.6224, p-value = 6.57e-05
Two-way ANOVA
Example from a statistics text book (Crawley, 2002), applied to a new context
Effects of fertilizer and light regime on growth of plant height in cm per time
fertilizer
high light
low light
A
8.3
6.6
A
8.7
7.2
B
8.1
6.9
B
8.5
8.3
C
9.1
7.9
C
9.0
9.2
factorial experiment (with replicates): each factor combination has more than one observation.
without replication:
no replicates per factor combination
this is possible, but does not allow identification of interactions
The following shows the raw p-values without correction:
c(p1, p2, p3)
[1] 0.576275261 0.027378832 0.001190592
… and with Holm correction:
p.adjust(c(p1, p2, p3))
[1] 0.576275261 0.054757664 0.003571775
Conclusions
Statistical methods
In case of Holm-corrected t-tests, only a single p-value (MCYST vs. Subst A) remains significant. This indicates that in this case, Holm’s method is more conservative than TukeyHSD (only one compared to two significant) effects.
An ANOVA with posthoc test is in general preferred,
but the sequential Holm-Bonferroni can be helpful in special cases.
Moreover, it demonstrates clearly that massive multiple testing needs to be avoided.
\(\Rightarrow\) ANOVA is to be preferred, when possible.
Interpretation
Regarding our original hypothesis, we can see that MCYST and SubstA did not inhibit growth of Pseudanabaena. In fact SubstA stimulated growth.
This was contrary to our expectations – the biological reason was then found 10 years later.
More about this can be found in Jähnichen et al. (2001), Jähnichen et al. (2007), Jähnichen et al. (2011), Zilliges et al. (2011) or Dziallas & Grossart (2011).
ANCOVA
Statistical question
Comparison of regression lines
Similar to ANOVA, but contains also metric variables (covariates)
Example
Annette Dobson’s birthweight data. A data set from a statistics textbook (Dobson, 2013), birth weight of boys and girls in dependence of the pregnancy week.
The birthweight data set
The data set is found at different places on the internet and in different versions.
Here the version that is found in an R demo: demo(lm.glm)
t.test(weight ~ gender, data = dobson, var.equal =TRUE)
Two Sample t-test
data: weight by gender
t = 0.97747, df = 22, p-value = 0.339
alternative hypothesis: true difference in means between group M and group F is not equal to 0
95 percent confidence interval:
-126.3753 351.7086
sample estimates:
mean in group M mean in group F
3024.000 2911.333
The box plot shows much overlap and the difference is not significant, because the t-test ignores important information: the pregnancy week.
ANCOVA makes use of covariates
m <-lm(weight ~ week * gender, data = dobson)anova(m)
use of a one-way ANOVA for uneaqual variances (in R: oneway.test)
Unbalanced case: unequal number of samples for each factor combination \(\rightarrow\) ANOVA results depend on the order of factors in the model formula.
Classical method: Type II or Type III ANOVA
Modern approach: model selection and likelihood ratio tests
Type II and type III ANOVA
function Anova (with upper case A) in package car
Help file of function Anova:
“Type-II tests are calculated according to the principle of marginality, testing each term after all others, except ignoring the term’s higher-order relatives; so-called type-III tests violate marginality, testing each term in the model after all of the others.”
Conclusion: Use type II and not type III.
Don’t try to interpret single terms in case of significant interactions.
Type II ANOVA: Example
library("car")m <-lm(growth ~ light * fert, data = plants)Anova(m, type="II")
Model with the smallest AIC \(\rightarrow\) optimal model.
anova(opt)
Analysis of Variance Table
Response: growth
Df Sum Sq Mean Sq F value Pr(>F)
fert 2 2.6600 1.33000 3.7246 0.07190 .
light 1 2.6133 2.61333 7.3186 0.02685 *
Residuals 8 2.8567 0.35708
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p < 0.05 \(\rightarrow\) significant
Results of the example:
optimal model (m2, opt), includes both factors fert and light but no interaction.
\(\Rightarrow\) Model selection identified fert and light as necessary explanatory variables, in contrast to the classical ANOVA table where only light is significant.
Significance tests?
The concept of model selection supercedes p-value-based statistics.
Some authors generally discourage to use p-values in this context, others recommend a compromize.
If you want to get a p-value, compare the optimal model with further reduced models, but still interpret the p-values with care:
anova(m2, m1a) # fertanova(m2, m1b) # light
In any case: focus on practical implications and don’t forget to report the effect sizes!
Summary of the ANOVA chapter
Linear models form the basis of many statistical methods
Linear regression
ANOVA, ANCOVA, GLM, GAM, GLMM, . . .
ANOVA/ANCOVA instead of multiple testing
ANOVA is more powerful than multiple tests:
avoids \(\alpha\)-error inflation
one big experiment needs less n than many small experiments
identification of interaction effects
elimination of co-variates
Model selection vs. p-value based testing
paradigm shift in statistics: AIC instead of p-value
more reliable, especially for imbalanced or complex designs
extensible to generalized, additive, and mixed models (GLM, GAM, LME, GLMM, …)
but: p-value based tests are sometimes easier to understand
Avoid p-value hacking
Do NOT repeat experiments until a significant p-value is found.
The high-ranked journal “… Nature asked influential statisticians to recommend one change to improve science. The common theme? The problem is not our maths, but ourselves.” (Leek et al. (2017)):
Five ways to fix statistics. Comment on Nature
Jeff Leek: Adjust for human cognition
Blakeley B. McShane & Andrew Gelman: Abandon statistical significance
David Colquhoun: State false-positive risk, too
Michèle B. Nuijten: Share analysis plans and results
Steven N. Goodman: Change norms from within
Self study
Read the paper of Johnson & Omland (2004) to gain more understanding of the model selection paradigm.
Dobson, A. J. (2013). Introduction to statistical modelling. Springer.
Dziallas, C., & Grossart, H.-P. (2011). Increasing Oxygen Radicals and Water Temperature Select for Toxic Microcystis sp. PLoS ONE, 6(9), e25569. https://doi.org/10.1371/journal.pone.0025569
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 65–70. https://www.jstor.org/stable/4615733
Jähnichen, S., Ihle, T., Petzoldt, T., & Benndorf, J. (2007). Impact of Inorganic Carbon Availability on Microcystin Production by Microcystis aeruginosa PCC 7806. Applied and Environmental Microbiology, 73(21), 6994–7002. https://doi.org/10.1128/AEM.01253-07
Jähnichen, S., Long, B. M., & Petzoldt, T. (2011). Microcystin production by Microcystis aeruginosa: Direct regulation by multiple environmental factors. Harmful Algae, 12, 95–104. https://doi.org/10.1016/j.hal.2011.09.002
Jähnichen, S., Petzoldt, T., & Benndorf, J. (2001). Evidence for control of microcystin dynamics in Bautzen Reservoir (Germany) by cyanobacterial population growth rates and dissolved inorganic carbon. Fundamental and Applied Limnology, 150(2), 177–196. https://doi.org/10.1127/archiv-hydrobiol/150/2001/177
Johnson, G., Jerald, & Omland, K. S. (2004). Model Selection in Ecology and Evolution. Trends in Ecology and Evolution, 19(2), 101–108. https://doi.org/10.1016/j.tree.2003.10.013
Zilliges, Y., Kehr, J.-C., Meissner, S., Ishida, K., Mikkat, S., Hagemann, M., Kaplan, A., Börner, T., & Dittmann, E. (2011). The Cyanobacterial Hepatotoxin Microcystin Binds to Proteins and Increases the Fitness of Microcystis under Oxidative Stress Conditions. PLoS ONE, 6(3), e17615. https://doi.org/10.1371/journal.pone.0017615