| number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| freqency | 0 | 1 | 5 | 5 | 6 | 4 | 12 | 3 | 3 |
Applied Statistics – A Practical Course
2025-11-20
Definition
\(\rightarrow\) https://en.wikipedia.org/wiki/Probability_distribution
Characteristics
Probability distributions are one of the core concepts in statistics and many statistics courses start with coin tossing1 or dice rolls. We begin with a small classroom experiment.
In a classroom experiment, students of an international course were asked for their favorite number from 1 to 9.
| number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| freqency | 0 | 1 | 5 | 5 | 6 | 4 | 12 | 3 | 3 |
The resulting distribution is:
Instead of real-world experiments, we can also use simulated random numbers.
Purpose
\(\rightarrow\) Simulation: important tool for statistical method development and understanding!
runif, random, uniform [1] 0.5114425 0.1479786 0.4758420 0.4116042 0.8703237 0.1085770 0.1970274
[8] 0.8633014 0.9360339 0.5007363

\[ f(x) = \begin{cases} \frac{1}{x_{max}-x_{min}} & \text{for } x \in [x_{min},x_{max}] \\ 0 & \text{otherwise} \end{cases} \]
The cdf is the integral of the density function:
\[ F(x) =\int_{-\infty}^{x} f(x) dx \] The total area (total probability) is \(1.0\):
\[ F(x) =\int_{-\infty}^{+\infty} f(x) dx = 1 \]
For the uniform distribution, it is:
\[ F(x) = \begin{cases} 0 & \text{for } x < x_{min} \\ \frac{x-x_{min}}{x_{max}-x_{min}} & \text{for } x \in [x_{min},x_{max}] \\ 1 & \text{for } x > x_{max} \end{cases} \]

… the inverse of the cumulative distribution function.

Cumulative distribution function

Quantile function
Example: In which range can we find 95% of a uniform distribution \(\mathbf{U}(40,60)\)?
The density function of the normal distribution is mathematically beautiful.
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, \mathrm{e}^{-\frac{(x-\mu)^2}{2 \sigma^2}} \]
C.F. Gauss, Gauss curve and formula on a German DM banknote from 1991–2001 (Wikipedia, CC0)
Sums of a large number \(n\) of independent and identically distributed random values are normally distributed, independently on the type of the original distribution.
\(\rightarrow\) row sums are approximately normal distributed

| Quantile | 1 | 1.64 | 1.96 | 2.0 | 2.33 | 2.57 | 3 | \(\mu \pm z\cdot \sigma\) |
|---|---|---|---|---|---|---|---|---|
| one-sided | 0.95 | 0.975 | 0.977 | 0.99 | 0.995 | 0.9986 | \(1-\alpha\) | |
| two-sided | 0.68 | 0.90 | 0.95 | 0.955 | 0.98 | 0.99 | 0.997 | \(1-\alpha/2\) |
Any normal distribution can be shifted scaled to form a standard normal with \(\mu=0, \sigma=1\)
Normal distribution

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, \mathrm{e}^{-\frac{(x-\mu)^2}{2 \sigma^2}} \]
\[
z = \frac{x-\mu}{\sigma}
\] \(\longrightarrow\) \(\longrightarrow\) \(\longrightarrow\)
Standard normal distribution

\[ f(x) = \frac{1}{\sqrt{2\pi}} \, \mathrm{e}^{-\frac{1}{2}x^2} \]
| df | 1.00 | 4.00 | 9.00 | 19.00 | 29.00 | 99.00 | 999.00 |
| t | 12.71 | 2.78 | 2.26 | 2.09 | 2.05 | 1.98 | 1.96 |
Examples: discharge of rivers, nutrient concentrations, algae biomass in a lakes
Quasi-poisson if \(\mu \neq \sigma^2\)
– depends only on \(\lambda\) resp. the number of counted units (\(k\))
Typical error of cell counting: 95% confidence interval
| counts | 2 | 3 | 5 | 10 | 50 | 100 | 200 | 400 | 1000 |
| lower | 0 | 1 | 2 | 5 | 37 | 81 | 173 | 362 | 939 |
| upper | 7 | 9 | 12 | 18 | 66 | 122 | 230 | 441 | 1064 |
Sums of a large number \(n\) of independent and identically distributed random values are normally distributed, independently on the type of the original distribution.
Reason: Methods like t-test or ANOVA are based on mean values.
Standard error
\[ s_{\bar{x}} = \frac{s}{\sqrt{n}} \]
Estimation of the 95% confidence interval:
\[ CI_{95\%} = \bigg(\bar{x} - z_{0.975} \cdot \frac{s}{\sqrt{n}}, \bar{x} + z_{0.975} \cdot \frac{s}{\sqrt{n}}\bigg) \]
with \(z_{1-\alpha/2} = z_{0.975} =\) \(1.96\).
\(\rightarrow\) \(2\sigma\) rule
prediction interval: characterizes the distribution of the data from the parameters of the sample (e.g. mean, standard deviation). It estimates the range where a single, future observation will likely fall.
standard deviation \(s_x\) measures the variability of the original data
reconstruct the original distribution if its type is known (e.g. normal, lognormal)
confidence interval: characterizes the precision of a statistical parameter, based on its standard error
Using \(\bar{x}\) and \(s_\bar{x}\), estimate the interval where we find \(\mu\) with a certain probability
less dependent on the original distribution of the data due to the CLT
\[ CI_{95\%} = \bigg(\bar{x} - t_{0.975, n-1} \cdot \frac{s}{\sqrt{n}}, \bar{x} + t_{0.975, n-1} \cdot \frac{s}{\sqrt{n}}\bigg) \]
qt()function in R.Example with \(\mu=50\) and \(\sigma=10\):
set.seed(123)
n <- 10
x <- rnorm(n, 50, 10)
m <- mean(x); s <- sd(x)
se <- s/sqrt(n)
# lower and upper confidence limits
m + qt(c(0.025, 0.975), n-1) * se[1] 43.92330 57.56922
\(\rightarrow\) the true mean (\(\mu\)=50) is in the interval CI = (43.9, 57.6).
\(\Rightarrow\) It can be wrong to exclude values only because they are “too big” or “too small”.
\(\rightarrow\) Try to find the reason, why values are extreme!
\(4 \sigma\)-rule
library(car)
x <- c(rnorm(20), 12) # the 21st value (=12) is an outlier
outlierTest(lm(x~1)) # x ~ 1 is the null model rstudent unadjusted p-value Bonferroni p
21 11.66351 4.1822e-10 8.7826e-09
\(\rightarrow\) The 21st value is identified as an outlier:
Alternative to outlier tests
rlm instead of lmextreme values outside the whiskers if more than 1.5 times distant from the box limits, compared to the width of the interquartile box.
sometimes called “outliers”.
I prefer the term “extreme value”, because they can be regular observations from a skewed or heavy tailed distribution.
Discharge data of the Elbe River in Dresden in \(\mathrm m^3 s^{-1}\), data source: Bundesanstalt für Gewässerkunde (BFG), see terms and conditions.