05-Classical Tests

Applied Statistics – A Practical Course

Thomas Petzoldt

2025-09-16

Hypotheses, errors and the p-value

Statistical test


A statistical hypothesis test is a method of statistical inference.

  • Commonly, two samples are compared, or a sample is compared against properties from an idealized model.
  • A hypothesis \(H_a\) for the statistical relationship between the two data sets, is compared to an idealized null hypothesis \(H_0\) that proposes no relationship between two data sets.
  • The comparison is considered statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis according to a threshold probability – the significance level.

adapted from: https://en.wikipedia.org/wiki/Statistical hypothesis testing

Effect size and significance


In case of relative mean differences, the relative effect size is:

\[ \delta = \frac{\bar{\mu}_1-\bar{\mu}_2}{\sigma}=\frac{\Delta}{\sigma} \]

with:

  • mean values of two populations \(\mu_1, \mu_2\)
  • effect size \(\Delta\)
  • relative effect size \(\delta\) (also called Cohen’s d)
  • significance means that an observed effect is unlikely the result of pure random variation.

Null hypothesis and alternative hypothesis


\(H_0\) null hypothesis: two populations are not different with respect to a certain property.

  • Assumption: observed effect occured purely at random, true effect is zero.

\(H_a\) alternative hypothesis (experimental hypothesis): existence of a certain effect.

  • An alternative hypothesis is never completely true or “proven”.
  • Acceptance of \(H_A\) means only than \(H_0\) is unlikely.

“Not significant” means either no effect or sample size too small!


Note: Different meaning of significance (\(H_0\) unlikely) and relevance (effect large enough to play a role in practice).

The p-value


The interpretation of the p-value was often confused in the past, even in statistics textbooks, so it is good to refer to a clear definition:

The p-value is defined as the probability of obtaining a result equal to or ‘more extreme’ than what was actually observed, when the null hypothesis is true.


Alpha and beta errors

Reality Decision of the test correct? probability
\(H_0\) = true significant no \(\alpha\)-error
\(H_0\) = false not significant no \(\beta\)-error
\(H_0\) = true not significant yes \(1-\alpha\)
\(H_0\) = false significant yes \(1-\beta\) (power)

1.\(H_0\) falsely rejected (error of the first kind or \(\alpha\)-error)

  • we claim an effect, that does not exist, e.g. a drug with no effect

2.\(H_0\) falsely retained (error of the second kind or \(\beta\)-error)

  • typical case in small studies, where effect was not enough to detect existing effects

Use in practice

  • common convention in environmental sciences: \(\alpha=0.05\), must be set beforehand
  • \(\beta=f(\alpha, \text{effectsize}, \text{sample size}, \text{kind of test})\), should be \(\le 0.2\)

Significance and relevance


Significance is not the only important. Focus also on effect size and relevance!

  • Statistical significance means that the null hypothesis \(H_0\) is unlikely in a statistical sense.

  • Practical relevance (sometimes called “practical significance”) means that the effect size is large enough to play a role in practice.

This means that whether an effect can be relevant or not depends on its effect size and the field of application.

Let’s for example consider a vaccination. If a vaccine had a significant effect in a clinical test, but protected only 10 out of 1000 people, one would not consider this effect as relevant and not produce this vaccine.

On the other hand, even small effects can be relevant. So if a toxic substance would have an effect on 1 out of 1000 people to produce cancer, we would consider this as relevant. To detect this as a significant effect would need an epidemiological study with a large number of people. But as it is highly relevant, it is worth the effort.

Take home messages

  • A p-value measures the probability that a purely random effect would be equally or more extreme than an observed effect if the null hypothesis is true.

  • Significant means the results are unlikely if there were no real effect.

  • Not significant doesn’t mean “no effect”.

  • Non-significant results suggest the need for further research, e.g.:

    • increase sample size
    • increase experimental effect
    • reduce experimental error
    • consider a more powerful statistical procedure
  • Don’t focus on p-values alone. Never forget to report also sample size, effect size and relevance of your results.

  • With large datasets:

    • statistically significant results can easily be obtained even for very small and practically irrelevant effects.
    • \(\rightarrow\) effect size and relevance become more important than p-values.

The p-value remains an important tool in statistics, but misuse can lead to misinterpretation.

Differences between mean values

One sample t-Test


  • tests if a sample is from a population with given mean value \(\mu\)
  • based on checking if the population mean \(\mu\) is in the confidence interval of \(\bar{x}\)
  1. Let’s assume a sample of size with \(n=10, \bar{x}=5.5, s=1\) and \(\mu=5\).
  2. Estimate the 95% confidence interval of \(\bar{x}\):

\[ CI = \bar{x} \pm t_{1-\alpha/2, n-1} \cdot s_{\bar{x}} \] with \[ s_{\bar{x}} = \frac{s}{\sqrt{n}} \qquad \text{(standard error)} \]

Different ways of calculation shown at the next slides

Remember: standard deviation and standard error

Visualization of a one-sample t-test. Left: original distribution of the data measured by standard deviation, right: distribution of mean values, measured by its standard error.

\[ s_{\bar{x}} = \frac{s}{\sqrt{n}} \qquad \text{(standard error)} \]

  • standard error < standard deviation
  • measures precision of the mean value
  • CLT!

The test is based on the distribution of the means, not distribution of original data.

Method 1: Is \(\mu\) in the confidence interval?


  1. Sample: \(n=10, \bar{x}=5.5, s=1\) and \(\mu=5\)

  2. Let \(\alpha = 0.05\), we get a two-sided 95% confidence interval with:

\[\bar{x} \pm t_{0,975, n-1} \cdot \frac{s}{\sqrt{n}}\]

5.5 + c(-1, 1) * qt(0.975, 10-1) * 1/sqrt(10)
[1] 4.784643 6.215357


  1. Check if \(\mu=5.0\) is in this interval?

  2. Yes, it is inside \(\Rightarrow\) difference not significant.

Method 2: Comparison with a tabulated t-value

  1. Rearrange the equation of the confidence interval, to calculate an observed \(t_{obs}\)

\[ t_{obs} = |\bar{x}-\mu | \cdot \frac{1}{s_{\bar{x}}} = \frac{|\bar{x}-\mu |}{s} \cdot \sqrt{n} = \frac{|5.5 -5.0|}{1.0} \cdot \sqrt{10} \]

We can calculate this in R:

t <- abs(5.5 - 5.0) / 1.0 * sqrt(10)
t
[1] 1.581139
  1. Compare \(t_{obs}\) with a tabulated value
  • “Old style”: find critical t-value in a table for given \(\alpha\) and degrees of freedom (\(n-1\))
  • For \(\alpha=0.05\) and two-sided, this is: \(t_{1-\alpha/2, n-1} = 2.26\).

Comparison: \(1.58 < 2.26\) \(\Rightarrow\) no significant difference between \(\bar{x}\) and \(\mu\).

Method 3: Calculation of the p-value from \(t_{obs}\)


  • use computerized probability function (pt) instead of table lookup
  • \(t = t_{obs}\) and the degrees of freedom (\(n-1\)):


2 * (1 - pt(t, df = 10 - 1)) # 2 * (1 - p) is re-arranged from 1-alpha/2
[1] 0.1483047

This p-value = 0.1483047 is greater than \(0.05\) so we consider the difference as not significant.


FAQ: less than or greater than?


p-value \(\text{p-value} < \alpha\) null hypothesis unlikely significant
test statistic \(t_{obs} > t_{1-\alpha/2, n-1}\) effect exceeds confint. significant

Method 4: Built-in t-test function in R


The same can be done much easier with the computer in R.

Let’s assume we have a sample with \(\bar{x}=5, s=1\):

## define sample
x <- c(5.5, 3.5, 5.4, 5.3, 6, 7.2, 5.4, 6.3, 4.5, 5.9)

## perform one-sample t-test
t.test(x, mu=5)

    One Sample t-test

data:  x
t = 1.5811, df = 9, p-value = 0.1483
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 4.784643 6.215357
sample estimates:
mean of x 
      5.5 

The test returns the observed t-value, the 95% confidence interval and the p-value.

An important difference is, that this method needs the original data, while the other methods need only mean, standard deviation and sample size.

Two sample t-test


The two-sample t-test compares two independent samples:

x1 <- c(5.3, 6.0, 7.1, 6.4, 5.7, 4.9, 5.0, 4.6, 5.7, 4.0, 4.5, 6.5)
x2 <- c(5.8, 7.1, 5.8, 7.0, 6.7, 7.7, 9.2, 6.0, 7.2, 7.8, 7.8, 5.7)
t.test(x1, x2)

    Welch Two Sample t-test

data:  x1 and x2
t = -3.7185, df = 21.611, p-value = 0.001224
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.3504462 -0.6662205
sample estimates:
mean of x mean of y 
 5.475000  6.983333 


  • \(\rightarrow\) both samples differ significantly (\(p < 0.05\))
  • Note: R has not performed the “ordinary” t-test but the Welch test (= heteroscedastic t-test)
  • where variances of both samples don’t need to be identical.

Hypothesis and formula of the two-sample t-test


\(H_0\) \(\mu_1 = \mu_2\)

\(H_a\) the two means are different

test criterion

\[ t_{obs} =\frac{|\bar{x}_1-\bar{x}_2|}{s_{tot}} \cdot \sqrt{\frac{n_1 n_2}{n_1+n_2}} \]

pooled standard deviation

\[ s_{tot} = \sqrt{{({n}_1 - 1)\cdot s_1^2 + ({n}_2 - 1)\cdot s_2^2 \over ({n}_1 + {n}_2 - 2)}} \]

assumptions: independence, equal variances, approximate normal distribution

The Welch test


Known as t-test for samples with unequal variance, works also for equal variance!


Test criterion:

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s^2_{\bar{x}_1} + s^2_{\bar{x}_2}}} \]

Standard error of each sample:

\[ s_{\bar{x}_i} = \frac{s_i}{\sqrt{n_i}} \] Corrected degrees of freedom:

\[ \text{df} = \frac{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}{\frac{s^4_1}{n^2_1(n_1-1)} + \frac{s^4_2}{n^2_2(n_2-1)}} \]

Welch test in R

t.test(x1, x2)

    Welch Two Sample t-test

data:  x1 and x2
t = -3.7185, df = 21.611, p-value = 0.001224
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.3504462 -0.6662205
sample estimates:
mean of x mean of y 
 5.475000  6.983333 

… is just the default method of the t.test-function.

Equality of variance: F-test

\(H_0\): \(\sigma_1^2 = \sigma_2^2\)

\(H_a\): variances unequal

Test criterion:

\[F = \frac{s_1^2}{s_2^2} \]

  • larger of the two variances in the enumerator \((s^2_1 > s^2_2)\)
  • separate degrees of freedom (\(n-1\))

Example:

  • \(s_1=1\), \(s_2 =2\), \(n_1=5, n_2=10, F=\frac{2^2}{1^2}=4\)
  • deg. of freedom: \(9 \atop 4\)

\(\Rightarrow\) \(F_{9, 4, \alpha=0.975} = 8.9 > 4 \quad\rightarrow\) not significant

Homogeneity of variances with > 2 samples


Bartlett’s test:

bartlett.test(list(x1, x2, x3))

    Bartlett test of homogeneity of variances

data:  list(x1, x2, x3)
Bartlett's K-squared = 7.7136, df = 2, p-value = 0.02114


Fligner-Killeen test (recommended):

fligner.test(list(x1, x2, x3))

    Fligner-Killeen test of homogeneity of variances

data:  list(x1, x2, x3)
Fligner-Killeen:med chi-squared = 2.2486, df = 2, p-value = 0.3249

  • tests are often used to check assumptions of the ANOVA

Recommendation for two sample t-tests


Traditional procedure:

  1. Test for equal variances using the F-test: var.test(x, y)
  2. If variances are equal: t.test(x, y, var.equal=TRUE)
  3. otherwise, use t.test(x, y) (= Welch test)
  4. Check if both samples follow a normal distribution.


Modern recommendation (preferred):

  1. Don’t use pre-tests!
  2. Always use the Welch test: t.test(x, y)
  3. Check approximate normal distribution with box- or QQ-plots. Less important if \(n\) is large.

see Zimmerman (2004) or Wikipedia.

Paired t-Test

  • sometimes also called “t-test of dependent samples”

    • the term “dependent” can be misleading, “paired” is clearer
    • values within samples must still be independent
  • examples: left arm / right arm; before / after

  • is essentially a one-sample t-test of pairwise differences against \(\mu=0\)

  • reduces the influence of individual differences (“covariates”) by focusing on the change within each pair

x1 <- c(2, 3, 4, 5, 6)
x2 <- c(3, 4, 7, 6, 8)
t.test(x1, x2, var.equal=TRUE)

    Two Sample t-test

data:  x1 and x2
t = -1.372, df = 8, p-value = 0.2073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.28924  1.08924
sample estimates:
mean of x mean of y 
      4.0       5.6 

p=0.20, not significant

x1 <- c(2, 3, 4, 5, 6)
x2 <- c(3, 4, 7, 6, 8)
t.test(x1, x2, paired=TRUE)

    Paired t-test

data:  x1 and x2
t = -4, df = 4, p-value = 0.01613
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.710578 -0.489422
sample estimates:
mean difference 
           -1.6 

p=0.016, significant

It can be seen that the paired t-test has a greater discriminatory power in this case.

Mann-Whitney and Wilcoxon-test


  • Non-parametric tests:
    • No assumptions about shape and parameters of distribution, but
    • distributions should be similar, otherwise test may be misleading.
  • Based on Ranks: Tests compare the ranks of the data.
  • Use Mann-Whitney for independent samples, Wilcoxon for paired samples.


Basic principle: Count of so-called “inversions” of ranks, where samples overlap

  • Sample A: 1, 3, 4, 5, 7
  • Sample B: 6, 8, 9, 10, 11
  • Both samples ordered together: 1, 3, 4, 5, 6, 7, 8, 9, 10, 11
  • Inversions: \(\rightarrow\) \(U = 1\)

Mann-Whitney test procedure in practice


  1. Assign ranks \(R_A\) and \(R_B\) to both samples \(A\), and \(B\) with sample size \(m\) and \(n\).
  2. Calculate number of inversions \(U\):

\[\begin{align*} U_A &= m \cdot n + \frac{m (m + 1)}{2} - \sum_{i=1}^m R_A \\ U_B &= m \cdot n + \frac{n (n + 1)}{2} - \sum_{i=1}^n R_B \\ U &= \min(U_A, U_B) \end{align*}\]

  • Critical values of \(U\) can be found in common statistics text books.
  • Not necessary in R, p-value directly printed.
  • Note: Use special version wilcox.exact with correction if sample has ties.

Mann-Whitney - Wilcoxon-test in R


A <- c(1, 3, 4, 5, 7)
B <- c(6, 8, 9, 10, 11)

wilcox.test(A, B) # use optional argument `paired = TRUE` for paired data.

    Wilcoxon rank sum exact test

data:  A and B
W = 1, p-value = 0.01587
alternative hypothesis: true location shift is not equal to 0


Mann-Whitney - Wilcoxon-test with tie correction

  • applied if the rank differences contain duplicated values
A <- c(1, 3, 4, 5, 7)
B <- c(6, 8, 9, 10, 11)
library("exactRankTests")
wilcox.exact(A, B, paired=TRUE)

    Exact Wilcoxon signed rank test

data:  A and B
V = 0, p-value = 0.0625
alternative hypothesis: true mu is not equal to 0

Permutation methods

  • Basic principle: Estimation of a test statistic \(\xi_{obs}\) from sample,
  • Resampling: Simulate many \(\xi_{i, sim}\) from randomly permuted data set (\(n = 999\) or more)
  • Where does \(\xi_{est}\) appear within the ordered series of simulated values \(\xi_{i, sim}\)?

Let \(\xi_{obs}\) be \(4.5\) in our example, then \(\Rightarrow\) \(p= 0.97\).

Power Analysis

Determining the power of statistical tests


How many replicates will I need?

  • Depends on:

    • the relative effect size \(\frac{\mathrm{effect}}{\mathrm{standard ~ deviation}}\)

    \[\delta=\frac{(\bar{x}_1-\bar{x}_2)}{s}\]

    • the sample size \(n\)
    • and the pre-defined significance level \(\alpha\)
    • and the applied method
  • The smaller \(\alpha\), \(n\) and \(\delta\), the bigger the type II (\(\beta\)) error.

  • The \(\beta\)-error is the probability to overlook effects despite of their existence.

  • Power (\(1-\beta\)) is the probability that a test is significant if an effect exists.

Power analysis


Formula for minimum sample size in the one-sample case:

\[ n = \bigg(\frac{z_\alpha + z_{1-\beta}}{\delta}\bigg)^2 \]

  • \(z\): the quantiles (qnorm) of the standard normal distribution for \(\alpha\) and for \(1-\beta\)
  • \(\delta=\Delta / s\): relative effect size.

Example

Two-tailed test with \(\alpha=0.025\) and \(\beta=0.2\)

\(\rightarrow\) \(z_\alpha = 1.96\), \(z_\beta=0.84\), then:

\[ n= (1.96 + 0.84)^2 \cdot 1/\delta^2 \approx 8 /\delta^2 \]

\(\delta = 1.0\cdot \sigma\) \(\qquad\Rightarrow\) n > 8

\(\delta = 0.5\cdot \sigma\) \(\qquad\Rightarrow\) n > 32

Power of the t-test

The power of a t-test, or the minimum sample size, can be calculated with: power.t.test():

power.t.test(n=5, delta=0.5, sig.level=0.05)

     Two-sample t test power calculation 

              n = 5
          delta = 0.5
             sd = 1
      sig.level = 0.05
          power = 0.1038399
    alternative = two.sided

NOTE: n is number in *each* group

\(\rightarrow\) power = 0.10

  • For \(n=5\) an existing effect of \(0.5\sigma\) is only detected in 1 out of 10 cases.
  • For a power of 80% at \(n=5\) we need an effect size of at least \(2\sigma\):
power.t.test(n=5, power=0.8, sig.level=0.05)

For a weak effect of \(0.5\sigma\) we need a sample size of \(n\ge64\) in each group:

power.t.test(delta=0.5,power=0.8,sig.level=0.05)

\(\Rightarrow\) we need either a large sample size or a strong effect.

Simulated power of a t-test

# population parameters
n      <- 10
xmean1 <- 50; xmean2 <- 55
xsd1   <- xsd2 <- 10
alpha  <- 0.05

nn <- 1000   # number of test runs in the simulation
a <- b <- 0  # initialize counters
for (i in 1:nn) {
  # create random numbers
  x1 <- rnorm(n, xmean1, xsd1)
  x2 <- rnorm(n, xmean2, xsd2)
  # results of the t-test
  p <- t.test(x1,x2,var.equal = TRUE)$p.value 
  if (p < alpha) {
     a <- a+1
   } else {
     b <- b+1
  }
}
print(paste("a=", a, ", b=", b, ", a/n=", a/nn, ", b/n=", b/nn))

Test for distributions

Testing for distributions

Nominal variables

  • \(\chi^2\)-test
  • Fisher’s exact test

Ordinal variables

  • Cramér-von-Mises-Test
  • \(\rightarrow\) more powerful than \(\chi^2\) or KS-test

Metric scales

  • Kolmogorov-Smirnov-Test (KS-test)
  • Shapiro-Wilks-Test (for normal distribution)
  • Graphical checks

Contingency tables for nominal variables

  • used for nominal (i.e. categorical or qualitative) data
  • examples: eye and hair color, medical treatment and the number of cured/not cured
  • important: use absolute mesurements (true numbers!), not percentages or other calculated data (e.g. not something like biomass per area)

Example: Occurence of Daphnia (water flea) in a lake:

Clone Upper layer Deep layer
A 50 87
B 37 78
C 72 45
  • food algae in the deep water, that was poor of oxygen
  • genetically evolved clones with higher haemoglobin content can dive into deep water

Calculation of the \(\chi^2\)-test

  1. Observed frequencies \(O_{ij}\)
Clone A Clone B Clone C Sum \(s_i\)
Upper layer 50 37 72 159
Lower layer 87 78 45 210
Sum \(s_j\) 137 115 117 \(n=369\)
  1. Expected frequencies \(E_{ij} = s_i \cdot s_j / n\) (balanced distribution = null hypothesis)
Clone A Clone B Clone C Sum \(s_i\)
Upper layer 59.0 49.6 50.4 159
Lower layer 78.0 65.4 66.6 210
Sum \(s_j\) 137 115 117 \(n=369\)
  1. Test statistic \(\hat{\chi}^2 = \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

  2. Compare with critical \(\chi^2\) from table with \((n_{row} - 1) \cdot (n_{col} - 1)\) df.

The \(\chi^2\)-test in R


Organize data in a matrix with 3 rows (for the clones) and 2 columns (for the depths):

x <- matrix(c(50, 37, 72, 87, 78, 45), ncol=2)
x
     [,1] [,2]
[1,]   50   87
[2,]   37   78
[3,]   72   45
chisq.test(x)

    Pearson's Chi-squared test

data:  x
X-squared = 24.255, df = 2, p-value = 5.408e-06
  • Note: The results are only reliable if all observed frequencies are \(\geq 5\).
  • For smaller samples, use Fisher’s exact test.

Fisher’s exact test


x <- matrix(c(50, 37, 72, 87, 78, 45), ncol=2)
x
     [,1] [,2]
[1,]   50   87
[2,]   37   78
[3,]   72   45
fisher.test(x)

    Fisher's Exact Test for Count Data

data:  x
p-value = 5.807e-06
alternative hypothesis: two.sided


\(\rightarrow\) significant correlation between the clones and vertical distribution in the lake.

Favorite numbers of HSE students

  • Numbers from 1..9, \(n=34\)
  • \(H_0\): equal probability of all numbers \(1/9\) (discrete uniform distribution)
  • \(H_A\): some numbers are favored \(\rightarrow\) departure from discrete uniform

Chisquare test


obsfreq <- c(1, 1, 6, 2, 2, 5, 8, 6, 3)
chisq.test(obsfreq)

    Chi-squared test for given probabilities

data:  obsfreq
X-squared = 13.647, df = 8, p-value = 0.09144
chisq.test(obsfreq, simulate.p.value=TRUE, B=1000)

    Chi-squared test for given probabilities with simulated p-value (based
    on 1000 replicates)

data:  obsfreq
X-squared = 13.647, df = NA, p-value = 0.0969


  • one-sample \(\chi^2\)-test. It tests for equality of frequency in all classes.
  • The simulation-based version of the test (with 1000 replicates) is slightly more precise than the standard \(\chi^2\)-test, but both are not significant.

Cramér-von-Mises-Test

\[ T = n \omega^2 = \frac{1}{12n} + \sum_{i=1}^n \left[ \frac{2i-1}{2n}-F(x_i) \right]^2 \]

Cramér-von-Mises-Test in R

library(dgof)
obsfreq <- c(1, 1, 6, 2, 2, 5, 8, 6, 3)

## CvM-test needs individual values, not class frequencies
x <- rep(1:length(obsfreq), obsfreq)
x
 [1] 1 2 3 3 3 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9


## create a cumulative function with equal probability of all cases
cdf <- stepfun(1:9, cumsum(c(0, rep(1/9, 9))))
cdf <- ecdf(1:9)

## perform the test
cvm.test(x, cdf)

    Cramer-von Mises - W2

data:  x
W2 = 0.51658, p-value = 0.03665
alternative hypothesis: Two.sided
  • The Cramér-von-Mises-test works with the original, unbinned values
  • Use of cumulative distribution function respects order of classes \(\rightarrow\) more powerful, than \(\chi^2\)-test.

Testing for normal distribution

Testing or checking?


Philosophical problem: We want to keep the \(H_0\)!

  • Equality cannot be tested
  • Therefore: better to say “checking normality”.

Think first

  • Does normal distribution “makes sense” for the data?
  • Are the data metric (continuous)?
  • What is the data generating process? \(\rightarrow\) Contextual understanding!

Inherent non-normality

Some types of data, such as count data (e.g., number of occurrences) and binary data (e.g., yes/no), are inherently non-normal.

  • Binary data: use methods for Binomial distribution with raw data instead of percentages
  • Count data: use methods designed for Poisson distribution

Shapiro-Wilks-W-Test ?

\(\rightarrow\) Aim: tests if a sample conforms to a normal distribution

x <- rnorm(100)
shapiro.test(x)

    Shapiro-Wilk normality test

data:  x
W = 0.99064, p-value = 0.7165


\(\rightarrow\) the \(p\)-value is greater than 0.05, so we would keep \(H_0\) and conclude that nothing speaks against acceptance of the normal


Interpration of the Shapiro-Wilks-test needs to be done with care:

  • for small \(n\), the test is not sensitive enough
  • far large \(n\), it is over-sensitive
  • using Shapiro-Wilks to check normality for t-test and ANOVA is not anymore recommended

Alternative: use graphical methods


  • histogram, boxplot, QQ-plot (=quantile-quantile plot)
  • see also: Box-Cox method

Graphical checks of normality


  • \(x\): theoretical quantiles where a value should be found if the distribution is normal
  • \(y\): normalized and ordered measured values (\(z\)-scores)
  • scaled in the unit of standard deviations
  • normal distribution if the points follow a straight line

Transformations

  • allows to apply methods designed for normally distributed data to non-normal cases
  • very common in the in the past, still sometimes useful
  • modern methods (e.g. generalized linear models, GLM) can handle certain distributions directly, such as the Binomial, Gamma, or Poisson distribution.

Transformations for right-skewed data

  • \(x'=\log(x)\)
  • \(x'=\log(x + a)\)
  • \(x'=(x+a)^c\) (\(a\) between 0.5 and 1)
  • \(x'=1/x\) (“very powerful”, i.e. to extreme in most cases)
  • \(x'=a - 1/\sqrt{x}\) (to make scale more convenient)
  • \(x'=1/\sqrt{x}\) (compromise between \(\ln\) and \(1/x\))
  • \(x'=a+bx^c\) (very general, includes powers and roots)

Transformations II

Transformations for count data

  • \(x'=\sqrt{3/8+x}\) (counts: 1, 2, 3 \(\rightarrow\) 0.61, 1.17, 1.54, 1.84, )
  • \(x'=\log(\log(x))\) for giant numbers

\(\rightarrow\) consider a GLM with family Poisson or quasi-Poisson instead

Ratios and percentages values between (0, 1)

  • \(x'=\arcsin \sqrt{x/n}\)
  • \(x'=\arcsin \sqrt{\frac{x+3/8}{n+3/4}}\)

\(\rightarrow\) consider a GLM with family binomial instead

How to find the best transformation?

  • Example: biovolumes of diatom algae cells (species Nitzschia acicularis).


dat <- read.csv("prk_nit.csv")

Nit85 <- dat$biovol[dat$group == "nit85"]
Nit90 <- dat$biovol[dat$group == "nit90"]

hist(Nit85, xlab="Biovolume (mm^3)")
hist(Nit90, xlab="Biovolume (mm^3)")

  • Right skewed distribution.
  • Biovolume calculations used for foodweb studies and algae bloom prediction.

Box-Cox method

\[ y' = \begin{cases} y^\lambda & | & \lambda \ne 0\\ \log(y) & | & \lambda =0 \end{cases} \] * Estimate optimal transformation from the class of powers and logarithms



library(MASS)

boxcox(Nit90 ~ 1)

  • Argument of boxcox is a so-called “model formula” or the outcome of a linear model (lm)
  • Most basic form is the “null model” without explanation variables (~ 1).
  • More about model formulas, see the ANOVA chapter.

Box Cox-method: Results


Interpretation


  • dotted vertical lines and horizontal 95%-line show the confidence limits for possible transformations.
  • Numbers are apporoximate \(\rightarrow\) round it to one decimal.
  • Here we can use either a log transformation (\(\lambda=0\)) or a power of \(\approx 0.5\).

Obtain the numerical value directly:

bc <- boxcox(Nit90 ~ 1)

str(bc)
List of 2
 $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
 $ y: num [1:100] -237 -233 -230 -226 -223 ...
bc$x[bc$y == max(bc$y)]
[1] 0.1818182

Test of pooled samples with different mean

boxcox(biovol ~ group, data = dat)
  • To test for joint distribution of all groups at once, specify explanatory variables at the right hand side of the model formula: biovol ~ group
  • Optimal transformation for both samples together is log.

Dependency and correlation

Correlation


Frequencies of nominal variables

  • \(\chi^2\)-test
  • Fisher’s exact test

⇒ dependence between plant society and soil type

(see before)

Ordinal variables

  • Spearman-Correlation

\(\rightarrow\) rank numbers

Metric scales

  • Pearson-correlation
  • Spearman-correlation

Variance and Covariance


Variance

  • measures variation of a single variable

\[ s^2_x = \frac{\text{sum of squares}}{\text{degrees of freedom}}=\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1} \]

Covariance

  • measures how two variables change together

\[ q_{x,y} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{n-1} \]

Correlation: scaled to \((-1, +1)\)

\[ r_{x,y} = \frac{q_{x,y}}{s_x \cdot s_y} \]

Correlation coefficient after Pearson


  • the usual correlation coefficient that we all know
  • tests for linear dependence

\[ r_p=\frac{\sum{(x_i-\bar{x}) (y_i-\bar{y})}} {\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}} \]

Or:

\[ r_p=\frac {\sum xy - \sum y \sum y / n} {\sqrt{(\sum x^2-(\sum x)^2/n)(\sum y^2-(\sum y)^2/n)}} \]

Range of values: \(-1 \le r_p \le +1\)

\(0\) no interdependence
\(+1 \,\text{or}\,-1\) strictly positive resp. negative dependence
\(0 < |r_p| < 1\) positive resp. negative dependence

Which size of correlation indicates dependency?

\(r=0.4, \quad p=0.0039\)

\(r=0.85, \quad p=0.07\)

Significant correlation?

\[ \hat{t}_{\alpha/2;n-2} =\frac{|r_p|\sqrt{n-2}}{\sqrt{1-r^2_p}} \]

\(t=0.829 \cdot \sqrt{1000-2}/\sqrt{1-0.829^2}=46.86, df=998\)


Quick test: critical values for \(r_p\)

\(n\) d.f. \(t\) \(r_{crit}\)
3 1 12.706 0.997
5 3 3.182 0.878
10 8 2.306 0.633
20 18 2.101 0.445
50 48 2.011 0.280
100 98 1.984 0.197
1000 998 1.962 0.062

Rank-correlation according to Spearman


  • measures monotonous (and not necessarily linear) dependence
  • estimation from rank differences:

\[ r_s=1-\frac{6 \sum d^2_i}{n(n^2-1)} \]

  • or, alternatively: Pearson-correlation of ranked data (necessary in case of ties).
  • Test: for \(n < 10\) \(\rightarrow\) table of critical values

for \(10 \leq n\) \(\rightarrow\) \(t\)-distribution

\[ \hat{t}_{1-\frac{\alpha}{2};n-2} =\frac{|r_s|}{\sqrt{1-r^2_S}} \sqrt{n-2} \]

Example


\(x\) \(y\) \(R_x\) \(R_y\) \(d\) \(d^2\)
1 2.7 1 1 0 0
2 7.4 2 2 0 0
3 20.1 3 3 0 0
4 500.0 4 5 -1 1
5 148.4 5 4 +1 1
2

\[ r_s=1-\frac{6 \cdot 2}{5\cdot (25-1)}=1-\frac{12}{120}=0.9 \]

For comparison: \(r_p=0.58\)

Application of Spearman’s-\(r_s\)


Advantages

  • distribution free (does not require normal distribution),
  • detects any dependence,
  • not much affected by outliers.

Disadvantages:

  • certain information loss due to ranking,
  • no information about type of dependency,
  • no direct relationship to coefficient of determination.

Conclusion: \(r_s\) is nevertheless highly recommended!

Correlation coefficients in R

  • Pearson’s product-moment correlation coefficient
  • Spearman’s rank correlation coefficient
x <- c(1, 2, 3, 5, 7,  9)
y <- c(3, 2, 5, 6, 8, 11)
cor.test(x, y, method="pearson")

    Pearson's product-moment correlation

data:  x and y
t = 7.969, df = 4, p-value = 0.001344
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7439930 0.9968284
sample estimates:
      cor 
0.9699203 

If linearity or normality of residuals is doubtful, use a rank correlation

cor.test(x, y, method="spearman")

    Spearman's rank correlation rho

data:  x and y
S = 2, p-value = 0.01667
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.9428571 

Problematic cases

Outlook: More than two independent variables


Multiple correlation

  • Example: Chl-a=\(f(x_1, x_2, x_3, \dots)\), where \(x_i\) = biomass of the \(i\)th phytoplankton species.
  • multiple correlation coefficient
  • partial correlation coefficient
  • attractive method \(\leftrightarrow\) but difficult in practice:
    • “independent” variables may correlate with each other (multi-collinearity)
      \(\Rightarrow\) bias of the multiple \(r\).
    • non-linearities are even more difficult to handle than in the two-sample case.

Recommendation:

  • Use multivariate methods (NMDS, PCA, …) for a first overview,
  • apply multiple regression with care and use process knowledge.

References


Hubbard, R. (2004). Alphabet soup: Blurring the distinctions between p’s and a’s in psychological research. Theory & Psychology, 14(3), 295–327. https://doi.org/10.1177/0959354304043638
Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. British Journal of Mathematical and Statistical Psychology, 57(1), 173–181. https://doi.org/10.1348/000711004849222