number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
freqency | 0 | 1 | 5 | 5 | 6 | 4 | 12 | 3 | 3 |
Applied Statistics – A Practical Course
2025-09-16
Definition
\(\rightarrow\) https://en.wikipedia.org/wiki/Probability_distribution
Characteristics
Probability distributions are one of the core concepts in statistics and many statistics courses start with coin tossing1 or dice rolls. We begin with a small classroom experiment.
In a classroom experiment, students of an international course were asked for their favorite number from 1 to 9.
number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
freqency | 0 | 1 | 5 | 5 | 6 | 4 | 12 | 3 | 3 |
The resulting distribution is:
Instead of real-world experiments, we can also use simulated random numbers.
Purpose
\(\rightarrow\) Simulation: important tool for statistical method development and understanding!
runif
, random, uniform [1] 0.1134328 0.5626773 0.5967718 0.6539964 0.9805977 0.1334710 0.9474710
[8] 0.8635867 0.7851835 0.3819017
\[ f(x) = \begin{cases} \frac{1}{x_{max}-x_{min}} & \text{for } x \in [x_{min},x_{max}] \\ 0 & \text{otherwise} \end{cases} \]
The cdf is the integral of the density function:
\[ F(x) =\int_{-\infty}^{x} f(x) dx \] The total area (total probability) is \(1.0\):
\[ F(x) =\int_{-\infty}^{+\infty} f(x) dx = 1 \]
For the uniform distribution, it is:
\[ F(x) = \begin{cases} 0 & \text{for } x < x_{min} \\ \frac{x-x_{min}}{x_{max}-x_{min}} & \text{for } x \in [x_{min},x_{max}] \\ 1 & \text{for } x > x_{max} \end{cases} \]
… the inverse of the cumulative distribution function.
Cumulative density function
Quantile function
Example: In which range can we find 95% of a uniform distribution \(\mathbf{U}(40,60)\)?
The density function of the normal distribution is mathematically beautiful.
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, \mathrm{e}^{-\frac{(x-\mu)^2}{2 \sigma^2}} \]
C.F. Gauss, Gauss curve and formula on a German DM banknote from 1991–2001 (Wikipedia, CC0)
Sums of a large number \(n\) of independent and identically distributed random values are normally distributed, independently on the type of the original distribution.
\(\rightarrow\) row sums are approximately normal distributed
Quantile | 1 | 1.64 | 1.96 | 2.0 | 2.33 | 2.57 | 3 | \(\mu \pm z\cdot \sigma\) |
---|---|---|---|---|---|---|---|---|
one-sided | 0.95 | 0.975 | 0.977 | 0.99 | 0.995 | 0.9986 | \(1-\alpha\) | |
two-sided | 0.68 | 0.90 | 0.95 | 0.955 | 0.98 | 0.99 | 0.997 | \(1-\alpha/2\) |
Any normal distribution can be shifted scaled to form a standard normal with \(\mu=0, \sigma=1\)
Normal distribution
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, \mathrm{e}^{-\frac{(x-\mu)^2}{2 \sigma^2}} \]
\[
z = \frac{x-\mu}{\sigma}
\] \(\longrightarrow\) \(\longrightarrow\) \(\longrightarrow\)
Standard normal distribution
\[ f(x) = \frac{1}{\sqrt{2\pi}} \, \mathrm{e}^{-\frac{1}{2}x^2} \]
df | 1.00 | 4.00 | 9.00 | 19.00 | 29.00 | 99.00 | 999.00 |
t | 12.71 | 2.78 | 2.26 | 2.09 | 2.05 | 1.98 | 1.96 |
Examples: discharge of rivers, nutrient concentrations, algae biomass in a lakes
Quasi-poisson if \(\mu \neq \sigma^2\)
– depends only on \(\lambda\) resp. the number of counted units (\(k\))
Typical error of cell counting: 95% confidence interval
counts | 2 | 3 | 5 | 10 | 50 | 100 | 200 | 400 | 1000 |
lower | 0 | 1 | 2 | 5 | 37 | 81 | 173 | 362 | 939 |
upper | 7 | 9 | 12 | 18 | 66 | 122 | 230 | 441 | 1064 |
Sometimes we want to know whether a data set belongs to a specific type of distribution. Though this sounds easy, it appears quite difficult for theoretical reasons:
This is in fact impossible, because “not significant” means only that a potential effect is either not existent or just too small to be detected. On the opposite, “significantly different” includes a certain probability of false positives.
However, most statistical tests do not require perfect agreement with a certain distribution:
\(\rightarrow\) Aim: tests if a sample conforms to a normal distribution
\(\rightarrow\) the \(p\)-value is greater than 0.05, so we would keep \(H_0\) and conclude that nothing speaks against acceptance of the normal
Interpration of the Shapiro-Wilks-test needs to be done with care:
Recommendation: Use graphical checks. Don’t trust the Shapiro Wilks!
Transformations for right-skewed data
Transformations for count data
\(\rightarrow\) consider a GLM with family Poisson or quasi-Poisson instead
Ratios and percentages
\(\rightarrow\) consider a GLM with family binomial instead
Example: Spearman correlation
Data set
Ranks
Two ways of calculation
Sums of a large number \(n\) of independent and identically distributed random values are normally distributed, independently on the type of the original distribution.
Reason: Methods like t-test or ANOVA are based on mean values.
Standard error
\[ s_{\bar{x}} = \frac{s}{\sqrt{n}} \]
Estimation of the 95% confidence interval:
\[ CI_{95\%} = \bigg(\bar{x} - z_{0.975} \cdot \frac{s}{\sqrt{n}}, \bar{x} + z_{0.975} \cdot \frac{s}{\sqrt{n}}\bigg) \]
with \(z_{1-\alpha/2} = z_{0.975} =\) \(1.96\).
\(\rightarrow\) \(2\sigma\) rule
sample interval: characterizes the distribution of the data from the parameters of the sample (e.g. mean, standard deviation)
standard deviation \(s_x\) measures the variability of the original data
reconstruct the original distribution if its type is known (e.g. normal, lognormal)
confidence interval: characterizes the precision of a statistical parameter, based on its standard error
Using \(\bar{x}\) and \(s_\bar{x}\), estimate the interval where we find \(\mu\) with a certain probability
less dependent on the original distribution of the data due to the CLT
\[ CI_{95\%} = \bigg(\bar{x} - t_{0.975, n-1} \cdot \frac{s}{\sqrt{n}}, \bar{x} + t_{0.975, n-1} \cdot \frac{s}{\sqrt{n}}\bigg) \]
qt()
function in R.Example with \(\mu=50\) and \(\sigma=10\):
set.seed(123)
n <- 10
x <- rnorm(n, 50, 10)
m <- mean(x); s <- sd(x)
se <- s/sqrt(n)
# lower and upper confidence limits
m + qt(c(0.025, 0.975), n-1) * se
[1] 43.92330 57.56922
\(\rightarrow\) the true mean (\(\mu\)=50) is in the interval CI = (43.9, 57.6).
\(\Rightarrow\) It can be wrong to exclude values only because they are “too big” or “too small”.
\(\rightarrow\) Try to find the reason, why values are extreme!
\(4 \sigma\)-rule
library(car)
x <- c(rnorm(20), 12) # the 21st value (=12) is an outlier
outlierTest(lm(x~1)) # x ~ 1 is the null model
rstudent unadjusted p-value Bonferroni p
21 11.66351 4.1822e-10 8.7826e-09
\(\rightarrow\) The 21st value is identified as an outlier:
Alternative to outlier tests
rlm
instead of lm
extreme values outside the whiskers if more than 1.5 times distant from the box limits, compared to the width of the interquartile box.
sometimes called “outliers”.
I prefer the term “extreme value”, because they can be regular observations from a skewed or heavy tailed distribution.
Discharge data of the Elbe River in Dresden in \(\mathrm m^3 s^{-1}\), data source: Bundesanstalt für Gewässerkunde (BFG), see terms and conditions.