05-Classical Tests

Applied Statistics – A Practical Course

Thomas Petzoldt

2025-11-20

Hypotheses, errors and the p-value

Statistical test

A statistical hypothesis test is a method of statistical inference.

Commonly, two samples are compared, or a sample is compared against properties from an idealized model.
A hypothesis \(H_a\) for the statistical relationship between the two data sets, is compared to an idealized null hypothesis \(H_0\) that proposes no relationship between two data sets.
The comparison is considered statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis according to a threshold probability – the significance level.

adapted from: https://en.wikipedia.org/wiki/Statistical hypothesis testing

Effect size and significance

In case of relative mean differences, the relative effect size is:

\[ \delta = \frac{\bar{\mu}_1-\bar{\mu}_2}{\sigma}=\frac{\Delta}{\sigma} \]

with:

mean values of two populations \(\mu_1, \mu_2\)
effect size \(\Delta\)
relative effect size \(\delta\) (also called Cohen’s d)
significance means that an observed effect is unlikely the result of pure random variation.

Null hypothesis and alternative hypothesis

\(H_0\) null hypothesis: two populations are not different with respect to a certain property.

Assumption: observed effect occured purely at random, true effect is zero.

\(H_a\) alternative hypothesis (experimental hypothesis): existence of a certain effect.

An alternative hypothesis is never completely true or “proven”.
Acceptance of \(H_A\) means only than \(H_0\) is unlikely.

“Not significant” means either no effect or sample size too small!

Note: Different meaning of significance (\(H_0\) unlikely) and relevance (effect large enough to play a role in practice).

The p-value

The interpretation of the p-value was often confused in the past, even in statistics textbooks, so it is good to refer to a clear definition:

The p-value is defined as the probability of obtaining a result equal to or ‘more extreme’ than what was actually observed, when the null hypothesis is true.

https://en.wikipedia.org/wiki/P-value:
Hubbard (2004) Alphabet Soup: Blurring the Distinctions Between p’s and a’s in Psychological Research, Theory Psychology 14(3), 295-327. DOI: 10.1177/0959354304043638

Alpha and beta errors

Reality	Decision of the test	correct?	probability
\(H_0\) = true	significant	no	\(\alpha\)-error
\(H_0\) = false	not significant	no	\(\beta\)-error
\(H_0\) = true	not significant	yes	\(1-\alpha\)
\(H_0\) = false	significant	yes	\(1-\beta\) (power)

1.\(H_0\) falsely rejected (error of the first kind or \(\alpha\)-error)

we claim an effect, that does not exist, e.g. a drug with no effect

2.\(H_0\) falsely retained (error of the second kind or \(\beta\)-error)

typical case in small studies, where effect was not enough to detect existing effects

Use in practice

common convention in environmental sciences: \(\alpha=0.05\), must be set beforehand
\(\beta=f(\alpha, \text{effectsize}, \text{sample size}, \text{kind of test})\), should be \(\le 0.2\)

Significance and relevance

Significance is not the only important. Focus also on effect size and relevance!

Statistical significance means that the null hypothesis \(H_0\) is unlikely in a statistical sense.
Practical relevance (sometimes called “practical significance”) means that the effect size is large enough to play a role in practice.

This means that whether an effect can be relevant or not depends on its effect size and the field of application.

Let’s for example consider a vaccination. If a vaccine had a significant effect in a clinical test, but protected only 10 out of 1000 people, one would not consider this effect as relevant and not produce this vaccine.

On the other hand, even small effects can be relevant. So if a toxic substance would have an effect on 1 out of 1000 people to produce cancer, we would consider this as relevant. To detect this as a significant effect would need an epidemiological study with a large number of people. But as it is highly relevant, it is worth the effort.

Take home messages

A p-value measures the probability that a purely random effect would be equally or more extreme than an observed effect if the null hypothesis is true.
Significant means the results are unlikely if there were no real effect.
Not significant doesn’t mean “no effect”.
Non-significant results suggest the need for further research, e.g.:
- increase sample size
- increase experimental effect
- reduce experimental error
- consider a more powerful statistical procedure
Don’t focus on p-values alone. Never forget to report also sample size, effect size and relevance of your results.
With large datasets:
- statistically significant results can easily be obtained even for very small and practically irrelevant effects.
- \(\rightarrow\) effect size and relevance become more important than p-values.

The p-value remains an important tool in statistics, but misuse can lead to misinterpretation.

Differences between mean values

One sample t-Test

tests if a sample is from a population with given mean value \(\mu\)
based on checking if the population mean \(\mu\) is in the confidence interval of \(\bar{x}\)

Let’s assume a sample of size with \(n=10, \bar{x}=5.5, s=1\) and \(\mu=5\).
Estimate the 95% confidence interval of \(\bar{x}\):

\[ CI = \bar{x} \pm t_{1-\alpha/2, n-1} \cdot s_{\bar{x}} \] with \[ s_{\bar{x}} = \frac{s}{\sqrt{n}} \qquad \text{(standard error)} \]

Different ways of calculation shown at the next slides

Remember: standard deviation and standard error

Visualization of a one-sample t-test. Left: original distribution of the data measured by standard deviation, right: distribution of mean values, measured by its standard error.

\[ s_{\bar{x}} = \frac{s}{\sqrt{n}} \qquad \text{(standard error)} \]

standard error < standard deviation
measures precision of the mean value
CLT!

The test is based on the distribution of the means, not distribution of original data.

Method 1: Is \(\mu\) in the confidence interval?

Sample: \(n=10, \bar{x}=5.5, s=1\) and \(\mu=5\)
Let \(\alpha = 0.05\), we get a two-sided 95% confidence interval with:

\[\bar{x} \pm t_{0,975, n-1} \cdot \frac{s}{\sqrt{n}}\]

5.5 + c(-1, 1) * qt(0.975, 10-1) * 1/sqrt(10)

[1] 4.784643 6.215357

Check if \(\mu=5.0\) is in this interval?
Yes, it is inside \(\Rightarrow\) difference not significant.

Method 2: Comparison with a tabulated t-value

Rearrange the equation of the confidence interval, to calculate an observed \(t_{obs}\)

\[ t_{obs} = |\bar{x}-\mu | \cdot \frac{1}{s_{\bar{x}}} = \frac{|\bar{x}-\mu |}{s} \cdot \sqrt{n} = \frac{|5.5 -5.0|}{1.0} \cdot \sqrt{10} \]

We can calculate this in R:

t <- abs(5.5 - 5.0) / 1.0 * sqrt(10)
t

[1] 1.581139

Compare \(t_{obs}\) with a tabulated value

“Old style”: find critical t-value in a table for given \(\alpha\) and degrees of freedom (\(n-1\))
For \(\alpha=0.05\) and two-sided, this is: \(t_{1-\alpha/2, n-1} = 2.26\).

Comparison: \(1.58 < 2.26\) \(\Rightarrow\) no significant difference between \(\bar{x}\) and \(\mu\).

Method 3: Calculation of the p-value from \(t_{obs}\)

use computerized probability function (pt) instead of table lookup
\(t = t_{obs}\) and the degrees of freedom (\(n-1\)):

2 * (1 - pt(t, df = 10 - 1)) # 2 * (1 - p) is re-arranged from 1-alpha/2

[1] 0.1483047

This p-value = 0.1483047 is greater than \(0.05\) so we consider the difference as not significant.

FAQ: less than or greater than?

p-value	\(\text{p-value} < \alpha\)	null hypothesis unlikely	significant
test statistic	\(t_{obs} > t_{1-\alpha/2, n-1}\)	effect exceeds confint.	significant

Method 4: Built-in t-test function in R

The same can be done much easier with the computer in R.

Let’s assume we have a sample with \(\bar{x}=5, s=1\):

## define sample
x <- c(5.5, 3.5, 5.4, 5.3, 6, 7.2, 5.4, 6.3, 4.5, 5.9)

## perform one-sample t-test
t.test(x, mu=5)


    One Sample t-test

data:  x
t = 1.5811, df = 9, p-value = 0.1483
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 4.784643 6.215357
sample estimates:
mean of x 
      5.5

The test returns the observed t-value, the 95% confidence interval and the p-value.

An important difference is, that this method needs the original data, while the other methods need only mean, standard deviation and sample size.

Two sample t-test

The two-sample t-test compares two independent samples:

x1 <- c(5.3, 6.0, 7.1, 6.4, 5.7, 4.9, 5.0, 4.6, 5.7, 4.0, 4.5, 6.5)
x2 <- c(5.8, 7.1, 5.8, 7.0, 6.7, 7.7, 9.2, 6.0, 7.2, 7.8, 7.8, 5.7)
t.test(x1, x2)


    Welch Two Sample t-test

data:  x1 and x2
t = -3.7185, df = 21.611, p-value = 0.001224
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.3504462 -0.6662205
sample estimates:
mean of x mean of y 
 5.475000  6.983333

\(\rightarrow\) both samples differ significantly (\(p < 0.05\))
Note: R has not performed the “ordinary” t-test but the Welch test (= heteroscedastic t-test)
where variances of both samples don’t need to be identical.

Hypothesis and formula of the two-sample t-test

\(H_0\) \(\mu_1 = \mu_2\)

\(H_a\) the two means are different

test criterion

\[ t_{obs} =\frac{|\bar{x}_1-\bar{x}_2|}{s_{tot}} \cdot \sqrt{\frac{n_1 n_2}{n_1+n_2}} \]

pooled standard deviation

\[ s_{tot} = \sqrt{{({n}_1 - 1)\cdot s_1^2 + ({n}_2 - 1)\cdot s_2^2 \over ({n}_1 + {n}_2 - 2)}} \]

assumptions: independence, equal variances, approximate normal distribution

The Welch test

Known as t-test for samples with unequal variance, works also for equal variance!

Test criterion:

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s^2_{\bar{x}_1} + s^2_{\bar{x}_2}}} \]

Standard error of each sample:

\[ s_{\bar{x}_i} = \frac{s_i}{\sqrt{n_i}} \] Corrected degrees of freedom:

\[ \text{df} = \frac{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}{\frac{s^4_1}{n^2_1(n_1-1)} + \frac{s^4_2}{n^2_2(n_2-1)}} \]

Welch test in R

t.test(x1, x2)


    Welch Two Sample t-test

data:  x1 and x2
t = -3.7185, df = 21.611, p-value = 0.001224
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.3504462 -0.6662205
sample estimates:
mean of x mean of y 
 5.475000  6.983333

… is just the default method of the t.test-function.

Equality of variance: F-test

\(H_0\): \(\sigma_1^2 = \sigma_2^2\)

\(H_a\): variances unequal

Test criterion:

\[F = \frac{s_1^2}{s_2^2} \]

larger of the two variances in the enumerator \((s^2_1 > s^2_2)\)
separate degrees of freedom (\(n-1\))

Example:

\(s_1=1\), \(s_2 =2\), \(n_1=5, n_2=10, F=\frac{2^2}{1^2}=4\)
deg. of freedom: \(9 \atop 4\)

\(\Rightarrow\) \(F_{9, 4, \alpha=0.975} = 8.9 > 4 \quad\rightarrow\) not significant

Homogeneity of variances with > 2 samples

Bartlett’s test:

bartlett.test(list(x1, x2, x3))


    Bartlett test of homogeneity of variances

data:  list(x1, x2, x3)
Bartlett's K-squared = 7.7136, df = 2, p-value = 0.02114

Fligner-Killeen test (recommended):

fligner.test(list(x1, x2, x3))


    Fligner-Killeen test of homogeneity of variances

data:  list(x1, x2, x3)
Fligner-Killeen:med chi-squared = 2.2486, df = 2, p-value = 0.3249

tests are often used to check assumptions of the ANOVA

Recommendation for two sample t-tests

Traditional procedure:

Test for equal variances using the F-test: var.test(x, y)
If variances are equal: t.test(x, y, var.equal=TRUE)
otherwise, use t.test(x, y) (= Welch test)
Check if both samples follow a normal distribution.

Modern recommendation (preferred):

Don’t use pre-tests!
Always use the Welch test: t.test(x, y)
Check approximate normal distribution with box- or QQ-plots. Less important if \(n\) is large.

see Zimmerman (2004) or Wikipedia.

Paired t-test

sometimes also called “t-test of dependent samples”
- the term “dependent” can be misleading, “paired” is clearer
- values within samples must still be independent
examples: left arm / right arm; before / after
is essentially a one-sample t-test of pairwise differences against \(\mu=0\)
reduces the influence of individual differences (“covariates”) by focusing on the change within each pair

x1 <- c(2, 3, 4, 5, 6)
x2 <- c(3, 4, 7, 6, 8)
t.test(x1, x2, var.equal=TRUE)


    Two Sample t-test

data:  x1 and x2
t = -1.372, df = 8, p-value = 0.2073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.28924  1.08924
sample estimates:
mean of x mean of y 
      4.0       5.6

p=0.20, not significant

x1 <- c(2, 3, 4, 5, 6)
x2 <- c(3, 4, 7, 6, 8)
t.test(x1, x2, paired=TRUE)


    Paired t-test

data:  x1 and x2
t = -4, df = 4, p-value = 0.01613
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.710578 -0.489422
sample estimates:
mean difference 
           -1.6

p=0.016, significant

It can be seen that the paired t-test has a greater discriminatory power in this case.

Mann-Whitney and Wilcoxon-test

Non-parametric tests:
- No assumptions about shape and parameters of distribution, but
- distributions should be similar, otherwise test may be misleading.
Based on Ranks: Tests compare the ranks of the data.
Use Mann-Whitney for independent samples, Wilcoxon for paired samples.

Basic principle: Count of so-called “inversions” of ranks, where samples overlap

Sample A: 1, 3, 4, 5, 7
Sample B: 6, 8, 9, 10, 11
Both samples ordered together: 1, 3, 4, 5, 6, 7, 8, 9, 10, 11
Inversions: \(\rightarrow\) \(U = 1\)

Mann-Whitney test procedure in practice

Assign ranks \(R_A\) and \(R_B\) to both samples \(A\), and \(B\) with sample size \(m\) and \(n\).
Calculate number of inversions \(U\):

\[\begin{align*} U_A &= m \cdot n + \frac{m (m + 1)}{2} - \sum_{i=1}^m R_A \\ U_B &= m \cdot n + \frac{n (n + 1)}{2} - \sum_{i=1}^n R_B \\ U &= \min(U_A, U_B) \end{align*}\]

Critical values of \(U\) can be found in common statistics text books.
Not necessary in R, p-value directly printed.
Note: Use special version wilcox.exact with correction if sample has ties.

Mann-Whitney - Wilcoxon-test in R

A <- c(1, 3, 4, 5, 7)
B <- c(6, 8, 9, 10, 11)

wilcox.test(A, B) # use optional argument `paired = TRUE` for paired data.


    Wilcoxon rank sum exact test

data:  A and B
W = 1, p-value = 0.01587
alternative hypothesis: true location shift is not equal to 0

Mann-Whitney - Wilcoxon-test with tie correction

applied if the rank differences contain duplicated values

A <- c(1, 3, 4, 5, 7)
B <- c(6, 8, 9, 10, 11)

library("exactRankTests")
wilcox.exact(A, B, paired=TRUE)


    Exact Wilcoxon signed rank test

data:  A and B
V = 0, p-value = 0.0625
alternative hypothesis: true mu is not equal to 0

Comparing two groups directly: A permutation test

Group 1: 10 observations from Gamma(2.3, 1)
Group 2: 10 observations from Gamma(1.0, 1)

Permutation test: the algoritm

Key Idea

Calculate a statistical parameter, e.g. difference of means \(\Delta\).
Simulate the null hypothesis: no group effect
- randomly permute the group labels
- null distribution of mean differences.

The algorithm (by example of a mean difference)

Compute observed mean difference (\(\Delta_{obs} = \bar{x}_1 - \bar{x}_2\)).
Randomly shuffle all 20 values and reassign to two groups of 10.
Compute new mean difference \(\Delta_{sim}\).
Repeat 999 times (or more) → get 999 permuted differences.
Compare observed difference to the distribution of permuted differences.
Estimate p-value as proportion of permuted differences ≥ observed.

\(\rightarrow\) Can be used to arbitrary statistical parameters and experimental designs.

Permutation test in practice

Where does \(\Delta_{obs}\) appear within the ordered series of simulated values \(\Delta_{i, sim}\)?

Let \(\Delta_{obs}\) be \(1.5\) in our example, then \(\Rightarrow\) \(p= 0.01\).

Power Analysis

Determining the power of statistical tests

How many replicates will I need?

Depends on:
- the relative effect size \(\frac{\mathrm{effect}}{\mathrm{standard ~ deviation}}\)
\[\delta=\frac{(\bar{x}_1-\bar{x}_2)}{s}\]
- the sample size \(n\)
- and the pre-defined significance level \(\alpha\)
- and the applied method
The smaller \(\alpha\), \(n\) and \(\delta\), the bigger the type II (\(\beta\)) error.
The \(\beta\)-error is the probability to overlook effects despite of their existence.
Power (\(1-\beta\)) is the probability that a test is significant if an effect exists.

Power analysis

Formula for minimum sample size in the one-sample case:

\[ n = \bigg(\frac{z_\alpha + z_{1-\beta}}{\delta}\bigg)^2 \]

\(z\): the quantiles (qnorm) of the standard normal distribution for \(\alpha\) and for \(1-\beta\)
\(\delta=\Delta / s\): relative effect size.

Example

Two-tailed test with \(\alpha=0.025\) and \(\beta=0.2\)

\(\rightarrow\) \(z_\alpha = 1.96\), \(z_\beta=0.84\), then:

\[ n= (1.96 + 0.84)^2 \cdot 1/\delta^2 \approx 8 /\delta^2 \]

\(\delta = 1.0\cdot \sigma\) \(\qquad\Rightarrow\) n > 8

\(\delta = 0.5\cdot \sigma\) \(\qquad\Rightarrow\) n > 32

Power of the t-test

The power of a t-test, or the minimum sample size, can be calculated with: power.t.test():

power.t.test(n=5, delta=0.5, sig.level=0.05)


     Two-sample t test power calculation 

              n = 5
          delta = 0.5
             sd = 1
      sig.level = 0.05
          power = 0.1038399
    alternative = two.sided

NOTE: n is number in *each* group

\(\rightarrow\) power = 0.10

For \(n=5\) an existing effect of \(0.5\sigma\) is only detected in 1 out of 10 cases.
For a power of 80% at \(n=5\) we need an effect size of at least \(2\sigma\):

power.t.test(n=5, power=0.8, sig.level=0.05)

For a weak effect of \(0.5\sigma\) we need a sample size of \(n\ge64\) in each group:

power.t.test(delta=0.5,power=0.8,sig.level=0.05)

\(\Rightarrow\) we need either a large sample size or a strong effect.

Simulated power of a t-test

# population parameters
n      <- 10
xmean1 <- 50; xmean2 <- 55
xsd1   <- xsd2 <- 10
alpha  <- 0.05

nn <- 1000   # number of test runs in the simulation
a <- b <- 0  # initialize counters
for (i in 1:nn) {
  # create random numbers
  x1 <- rnorm(n, xmean1, xsd1)
  x2 <- rnorm(n, xmean2, xsd2)
  # results of the t-test
  p <- t.test(x1,x2,var.equal = TRUE)$p.value 
  if (p < alpha) {
     a <- a+1
   } else {
     b <- b+1
  }
}
print(paste("a=", a, ", b=", b, ", a/n=", a/nn, ", b/n=", b/nn))

Test for distributions

Testing for distributions

Nominal variables

\(\chi^2\)-test
Fisher’s exact test

Ordinal variables

Cramér-von-Mises-Test
\(\rightarrow\) more powerful than \(\chi^2\) or KS-test

Metric scales

Kolmogorov-Smirnov-Test (KS-test)
Shapiro-Wilks-Test (for normal distribution)
Graphical checks

Contingency tables for nominal variables

used for nominal (i.e. categorical or qualitative) data
examples: eye and hair color, medical treatment and the number of cured/not cured
important: use absolute mesurements (true numbers!), not percentages or other calculated data (e.g. not something like biomass per area)

Example: Occurence of Daphnia (water flea) in a lake:

Clone	Upper layer	Deep layer
A	50	87
B	37	78
C	72	45

food algae in the deep water, that was poor of oxygen
genetically evolved clones with higher haemoglobin content can dive into deep water

Calculation of the \(\chi^2\)-test

Observed frequencies \(O_{ij}\)

	Clone A	Clone B	Clone C	Sum \(s_i\)
Upper layer	50	37	72	159
Lower layer	87	78	45	210
Sum \(s_j\)	137	115	117	\(n=369\)

Expected frequencies \(E_{ij} = s_i \cdot s_j / n\) (balanced distribution = null hypothesis)

	Clone A	Clone B	Clone C	Sum \(s_i\)
Upper layer	59.0	49.6	50.4	159
Lower layer	78.0	65.4	66.6	210
Sum \(s_j\)	137	115	117	\(n=369\)

Test statistic \(\hat{\chi}^2 = \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)
Compare with critical \(\chi^2\) from table with \((n_{row} - 1) \cdot (n_{col} - 1)\) df.

The \(\chi^2\)-test in R

Organize data in a matrix with 3 rows (for the clones) and 2 columns (for the depths):

x <- matrix(c(50, 37, 72, 87, 78, 45), ncol=2)
x

     [,1] [,2]
[1,]   50   87
[2,]   37   78
[3,]   72   45

chisq.test(x)


    Pearson's Chi-squared test

data:  x
X-squared = 24.255, df = 2, p-value = 5.408e-06

Note: The results are only reliable if all observed frequencies are \(\geq 5\).
For smaller samples, use Fisher’s exact test.

Fisher’s exact test

x <- matrix(c(50, 37, 72, 87, 78, 45), ncol=2)
x

     [,1] [,2]
[1,]   50   87
[2,]   37   78
[3,]   72   45

fisher.test(x)


    Fisher's Exact Test for Count Data

data:  x
p-value = 5.807e-06
alternative hypothesis: two.sided

\(\rightarrow\) significant correlation between the clones and vertical distribution in the lake.

Favorite numbers of HSE students

Numbers from 1..9, \(n=34\)
\(H_0\): equal probability of all numbers \(1/9\) (discrete uniform distribution)
\(H_A\): some numbers are favored \(\rightarrow\) departure from discrete uniform

Chisquare test

obsfreq <- c(1, 1, 6, 2, 2, 5, 8, 6, 3)
chisq.test(obsfreq)


    Chi-squared test for given probabilities

data:  obsfreq
X-squared = 13.647, df = 8, p-value = 0.09144

chisq.test(obsfreq, simulate.p.value=TRUE, B=1000)


    Chi-squared test for given probabilities with simulated p-value (based
    on 1000 replicates)

data:  obsfreq
X-squared = 13.647, df = NA, p-value = 0.0959

one-sample \(\chi^2\)-test. It tests for equality of frequency in all classes.
The simulation-based version of the test (with 1000 replicates) is slightly more precise than the standard \(\chi^2\)-test, but both are not significant.

Cramér-von-Mises-Test

\[ T = n \omega^2 = \frac{1}{12n} + \sum_{i=1}^n \left[ \frac{2i-1}{2n}-F(x_i) \right]^2 \]

Cramér-von-Mises-Test in R

library(dgof)
obsfreq <- c(1, 1, 6, 2, 2, 5, 8, 6, 3)

## CvM-test needs individual values, not class frequencies
x <- rep(1:length(obsfreq), obsfreq)
x

 [1] 1 2 3 3 3 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9

## create a cumulative function with equal probability of all cases
cdf <- stepfun(1:9, cumsum(c(0, rep(1/9, 9))))
cdf <- ecdf(1:9)

## perform the test
cvm.test(x, cdf)


    Cramer-von Mises - W2

data:  x
W2 = 0.51658, p-value = 0.03665
alternative hypothesis: Two.sided

The Cramér-von-Mises-test works with the original, unbinned values
Use of cumulative distribution function respects order of classes \(\rightarrow\) more powerful, than \(\chi^2\)-test.

Testing for normal distribution

Why do we want this?

Sometimes we want to know whether a data set belongs to a specific type of distribution. Though this sounds easy, it appears quite difficult for theoretical reasons:

statistical tests check for deviations from the null hypothesis
but here we want to test the opposite, if \(H_0\) is true

This is in fact impossible, because “not significant” means only that a potential effect is either not existent or just too small to be detected. On the opposite, “significantly different” includes a certain probability of false positives.

However, most statistical tests do not require perfect agreement with a certain distribution:

t-test and ANOVA assume normality of residuals
due to the CLT, the distribution of sums and mean values converges to normal

Testing or checking?

Philosophical problem: We want to keep the \(H_0\)!

Equality cannot be tested
Therefore: better to say “checking normality”.

Think first

Does normal distribution “makes sense” for the data?
Are the data metric (continuous)?
What is the data generating process? \(\rightarrow\) Contextual understanding!

Inherent non-normality

Some types of data, such as count data (e.g., number of occurrences) and binary data (e.g., yes/no), are inherently non-normal.

Binary data: use methods for Binomial distribution with raw data instead of percentages
Count data: use methods designed for Poisson distribution

Shapiro-Wilks-W-Test ?

\(\rightarrow\) Aim: tests if a sample conforms to a normal distribution

x <- runif(20)
shapiro.test(x)


    Shapiro-Wilk normality test

data:  x
W = 0.93541, p-value = 0.1961

\(\rightarrow\) the \(p\)-value is greater than 0.05, so we would keep \(H_0\) and conclude that nothing speaks against acceptance of the normal

Interpration of the Shapiro-Wilks-test needs to be done with care:

for small \(n\), the test is not sensitive enough
far large \(n\), it is over-sensitive
using Shapiro-Wilks to check normality for t-test and ANOVA is not anymore recommended

Alternative: use graphical methods

histogram, boxplot, QQ-plot (=quantile-quantile plot)
see also: Box-Cox method

Graphical checks of normality

\(x\): theoretical quantiles where a value should be found if the distribution is normal
\(y\): normalized and ordered measured values (\(z\)-scores)
scaled in the unit of standard deviations
normal distribution if the points follow a straight line

Checking distributions for descriptive purposes

In some disciplines, such as hydrology, it is occasionally necessary to know which distribution type best describes a dataset. This is especially critical for Extreme Value Analysis (e.g., the 100-year flood), as the Central Limit Theorem (CLT) does not apply here.

Procedure

Visual Inspection (Focus on the Tails)
- Histogram (first impression)
- Q-Q Plots (detailed goodness-of-fit)
Formal Tests (exclusion principle)
- Tests: Kolmogorov-Smirnov, Cramér-von Mises, Anderson-Darling
- Reject the distribution type if the p-value is significant (p<0.05).
Model Selection (Choosing the best model)
- Choose the model that provides the best fit while also demonstrating Parsimony
- Criteria like AIC/BIC will be covered later
- Consider the physical plausibility of the parameters.

Transformations

allows to apply methods designed for normally distributed data to non-normal cases
very common in the in the past, still sometimes useful
modern methods (e.g. generalized linear models, GLM) can handle certain distributions directly, such as the Binomial, Gamma, or Poisson distribution.

Transformations for right-skewed data

\(x'=\log(x)\)
\(x'=\log(x + a)\)
\(x'=(x+a)^c\) (\(a\) between 0.5 and 1)
\(x'=1/x\) (“very powerful”, i.e. to extreme in most cases)
\(x'=a - 1/\sqrt{x}\) (to make scale more convenient)
\(x'=1/\sqrt{x}\) (compromise between \(\ln\) and \(1/x\))
\(x'=a+bx^c\) (very general, includes powers and roots)

Transformations II

Transformations for count data

\(x'=\sqrt{3/8+x}\) (counts: 1, 2, 3 \(\rightarrow\) 0.61, 1.17, 1.54, 1.84, )
\(x'=\log(\log(x))\) for giant numbers

\(\rightarrow\) consider a GLM with family Poisson or quasi-Poisson instead

Ratios and percentages values between (0, 1)

\(x'=\arcsin \sqrt{x/n}\)
\(x'=\arcsin \sqrt{\frac{x+3/8}{n+3/4}}\)

\(\rightarrow\) consider a GLM with family binomial instead

How to find the best transformation?

Example: biovolumes of diatom algae cells (species Nitzschia acicularis).

dat <- read.csv("prk_nit.csv")

Nit85 <- dat$biovol[dat$group == "nit85"]
Nit90 <- dat$biovol[dat$group == "nit90"]

hist(Nit85, xlab="Biovolume (mm^3)")
hist(Nit90, xlab="Biovolume (mm^3)")

Right skewed distribution.
Biovolume calculations used for foodweb studies and algae bloom prediction.

Box-Cox method

Estimate optimal transformation from the class of powers and logarithms

\[ y' = \begin{cases} \frac{y^\lambda -1}{\lambda} & | & \lambda \ne 0\\ \log(y) & | & \lambda =0 \end{cases} \] In some practical cases, one can simply use \(y^\lambda\) instead of \(\frac{y^\lambda -1}{\lambda}\).

library(MASS)

boxcox(Nit90 ~ 1)

Argument of boxcox is a “model formula” or the outcome of a linear model (lm)
Most basic form is the “null model” without explanation variables (~ 1).
More about model formulas, see the ANOVA chapter.

Box Cox-method: Results

Interpretation

dotted vertical lines and horizontal 95%-line show the confidence limits for possible transformations.
Numbers are apporoximate \(\rightarrow\) round it to one decimal.
Here we can use either a log transformation (\(\lambda=0\)) or a power of \(\approx 0.5\).

Obtain the numerical value directly:

bc <- boxcox(Nit90 ~ 1)

str(bc)

List of 2
 $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
 $ y: num [1:100] -237 -233 -230 -226 -223 ...

bc$x[bc$y == max(bc$y)]

[1] 0.1818182

Test of pooled samples with different mean

boxcox(biovol ~ group, data = dat)

To test for joint distribution of all groups at once, specify explanatory variables at the right hand side of the model formula: biovol ~ group
Optimal transformation for both samples together is log.

Dependency and correlation

Correlation

Frequencies of nominal variables

\(\chi^2\)-test
Fisher’s exact test

⇒ dependence between plant society and soil type

(see before)

Ordinal variables

Spearman-Correlation

\(\rightarrow\) rank numbers

Metric scales

Pearson-correlation
Spearman-correlation

Variance and Covariance

Variance

measures variation of a single variable

\[ s^2_x = \frac{\text{sum of squares}}{\text{degrees of freedom}}=\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1} \]

Covariance

measures how two variables change together

\[ q_{x,y} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{n-1} \]

Correlation: scaled to \((-1, +1)\)

\[ r_{x,y} = \frac{q_{x,y}}{s_x \cdot s_y} \]

Correlation coefficient after Pearson

the usual correlation coefficient that we all know
tests for linear dependence

\[ r_p=\frac{\sum{(x_i-\bar{x}) (y_i-\bar{y})}} {\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}} \]

Or:

\[ r_p=\frac {\sum xy - \sum y \sum y / n} {\sqrt{(\sum x^2-(\sum x)^2/n)(\sum y^2-(\sum y)^2/n)}} \]

Range of values: \(-1 \le r_p \le +1\)

\(0\)	no interdependence
\(+1 \,\text{or}\,-1\)	strictly positive resp. negative dependence
\(0 < \|r_p\| < 1\)	positive resp. negative dependence

Which size of correlation indicates dependency?

\(r=0.4, \quad p=0.0039\)

\(r=0.85, \quad p=0.07\)

Significant correlation?

\[ \hat{t}_{\alpha/2;n-2} =\frac{|r_p|\sqrt{n-2}}{\sqrt{1-r^2_p}} \]

\(t=0.829 \cdot \sqrt{1000-2}/\sqrt{1-0.829^2}=46.86, df=998\)

Quick test: critical values for \(r_p\)

\(n\)	d.f.	\(t\)	\(r_{crit}\)
3	1	12.706	0.997
5	3	3.182	0.878
10	8	2.306	0.633
20	18	2.101	0.445
50	48	2.011	0.280
100	98	1.984	0.197
1000	998	1.962	0.062

Rank-correlation according to Spearman

measures monotonous (and not necessarily linear) dependence
estimation from rank differences:

\[ r_s=1-\frac{6 \sum d^2_i}{n(n^2-1)} \]

or, alternatively: Pearson-correlation of ranked data (necessary in case of ties).
Test: for \(n < 10\) \(\rightarrow\) table of critical values

for \(10 \leq n\) \(\rightarrow\) \(t\)-distribution

\[ \hat{t}_{1-\frac{\alpha}{2};n-2} =\frac{|r_s|}{\sqrt{1-r^2_S}} \sqrt{n-2} \]

Example

\(x\)	\(y\)	\(R_x\)	\(R_y\)	\(d\)	\(d^2\)
1	2.7	1	1	0	0
2	7.4	2	2	0	0
3	20.1	3	3	0	0
4	500.0	4	5	-1	1
5	148.4	5	4	+1	1
					2

\[ r_s=1-\frac{6 \cdot 2}{5\cdot (25-1)}=1-\frac{12}{120}=0.9 \]

For comparison: \(r_p=0.58\)

Application of Spearman’s-\(r_s\)

Advantages

distribution free (does not require normal distribution),
detects any dependence,
not much affected by outliers.

Disadvantages:

certain information loss due to ranking,
no information about type of dependency,
no direct relationship to coefficient of determination.

Conclusion: \(r_s\) is nevertheless highly recommended!

Correlation coefficients in R

Pearson’s product-moment correlation coefficient
Spearman’s rank correlation coefficient

x <- c(1, 2, 3, 5, 7,  9)
y <- c(3, 2, 5, 6, 8, 11)
cor.test(x, y, method="pearson")


    Pearson's product-moment correlation

data:  x and y
t = 7.969, df = 4, p-value = 0.001344
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7439930 0.9968284
sample estimates:
      cor 
0.9699203

If linearity or normality of residuals is doubtful, use a rank correlation

cor.test(x, y, method="spearman")


    Spearman's rank correlation rho

data:  x and y
S = 2, p-value = 0.01667
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.9428571

Problematic cases

Outlook: More than two independent variables

Multiple correlation

Example: Chl-a=\(f(x_1, x_2, x_3, \dots)\), where \(x_i\) = biomass of the \(i\)th phytoplankton species.
multiple correlation coefficient
partial correlation coefficient
attractive method \(\leftrightarrow\) but difficult in practice:
- “independent” variables may correlate with each other (multi-collinearity)
  \(\Rightarrow\) bias of the multiple \(r\).
- non-linearities are even more difficult to handle than in the two-sample case.

Recommendation:

Use multivariate methods (NMDS, PCA, …) for a first overview,
apply multiple regression with care and use process knowledge.

References

Hubbard, R. (2004). Alphabet soup: Blurring the distinctions between p’s and a’s in psychological research. Theory & Psychology, 14(3), 295–327. https://doi.org/10.1177/0959354304043638

Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. British Journal of Mathematical and Statistical Psychology, 57(1), 173–181. https://doi.org/10.1348/000711004849222