04-Distributions

Applied Statistics – A Practical Course

Thomas Petzoldt

2025-11-11

Probability Distributions


Definition

  • a mathematical function
  • probabilities of occurrence of different possible outcomes for an experiment

\(\rightarrow\) https://en.wikipedia.org/wiki/Probability_distribution


Characteristics

  1. a specific shape (distribution type, a mathematical formula)
  2. can be described by its parameters (e.g. mean \(\mu\) and standard deviation \(\sigma\)).

Probability distributions are one of the core concepts in statistics and many statistics courses start with coin tossing1 or dice rolls. We begin with a small classroom experiment.

What is your favorite number?

In a classroom experiment, students of an international course were asked for their favorite number from 1 to 9.

number 1 2 3 4 5 6 7 8 9
freqency 0 1 5 5 6 4 12 3 3

The resulting distribution is:

  • empirical: data from an experiment
  • discrete: only discrete numbers (1, 2, 3 …, 9) possible, no fractions

Computer simulations


Instead of real-world experiments, we can also use simulated random numbers.

  • advantage: we can simulate data from distributions with known properties
  • challenge: somewhat abstract

Purpose

  • get a feeling about randomness, how a sample following a given “theory” can look like
  • explore and test statistical methods and train understanding
  • a tool for experimental design
  • testing application and power of an analysis beforehand


\(\rightarrow\) Simulation: important tool for statistical method development and understanding!

Continuos uniform distribution \(\mathbf{U}(0, 1)\)

  • same probability of occurence in a given interval
  • e.g. \([0, 1]\)
  • in R: runif, random, uniform
runif(10)
 [1] 0.7543490 0.3742862 0.7793269 0.5062845 0.7014560 0.9209131 0.3238158
 [8] 0.5593060 0.1426554 0.5665669



  • binning: subdivide values into classes

Density function of \(\mathbf{U}(x_{min}, x_{max})\)

  • density \(f(X)\), sometimes abbreviated as “pdf” (probability density function):

\[ f(x) = \begin{cases} \frac{1}{x_{max}-x_{min}} & \text{for } x \in [x_{min},x_{max}] \\ 0 & \text{otherwise} \end{cases} \]

  • area under the curve (i.e. the integral) = 1.0
  • 100% of the events are between \(-\infty\) and \(+\infty\)
    and for \(\mathbf{U}(x_{min}, x_{max})\) in the interval \([x_{min}, y_{max}]\)

Cumulative distribution function of \(\mathbf{U}(x_{min}, x_{max})\)


The cdf is the integral of the density function:

\[ F(x) =\int_{-\infty}^{x} f(x) dx \] The total area (total probability) is \(1.0\):

\[ F(x) =\int_{-\infty}^{+\infty} f(x) dx = 1 \]

For the uniform distribution, it is:

\[ F(x) = \begin{cases} 0 & \text{for } x < x_{min} \\ \frac{x-x_{min}}{x_{max}-x_{min}} & \text{for } x \in [x_{min},x_{max}] \\ 1 & \text{for } x > x_{max} \end{cases} \]

Quantile function


… the inverse of the cumulative distribution function.

Cumulative distribution function

Quantile function

Example: In which range can we find 95% of a uniform distribution \(\mathbf{U}(40,60)\)?

Summary: Uniform distribution


The normal distribution

The normal distribution \(\mathbf{N}(\mu, \sigma)\)

  • of high theoretical importance due to the central limit theorem (CLT)
  • results from adding a large number of random values of same order of magnitude.

The density function of the normal distribution is mathematically beautiful.

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, \mathrm{e}^{-\frac{(x-\mu)^2}{2 \sigma^2}} \]

C.F. Gauss, Gauss curve and formula on a German DM banknote from 1991–2001 (Wikipedia, CC0)

The central limit theorem (CLT)


Sums of a large number \(n\) of independent and identically distributed random values are normally distributed, independently on the type of the original distribution.


A Simulation experiment


  1. generate a matrix with 100 rows and 25 columns of uniformly distributed random numbers
  2. compute the row sums
par(mfrow=c(2, 1), las=1)
set.seed(42)
x  <- matrix(runif(25 * 100), ncol = 25)

# View(x) # uncomment this to show the matrix

x_sums <- rowSums(x)
hist(x)
hist(x_sums)


\(\rightarrow\) row sums are approximately normal distributed

Random numbers and density function

Density and quantiles of the standard normal

  • in theory, 50% of the values are below and 50% above the mean value
  • 95% are between \(\pm 2 \sigma\)

Density and quantiles of the standard normal

Cumulative distribution function – Quantile function

Quantile 1 1.64 1.96 2.0 2.33 2.57 3 \(\mu \pm z\cdot \sigma\)
one-sided 0.95 0.975 0.977 0.99 0.995 0.9986 \(1-\alpha\)
two-sided 0.68 0.90 0.95 0.955 0.98 0.99 0.997 \(1-\alpha/2\)

Standard normal, scaling and shifting

  • \(\mu\) is the shift parameter that moves the whole bell shaped curve along the \(x\) axis
  • \(\sigma\) is the scale parameter to stretch or compress in the direction of \(x\)

Standardization (\(z\)-transformation)


Any normal distribution can be shifted scaled to form a standard normal with \(\mu=0, \sigma=1\)


Normal distribution

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, \mathrm{e}^{-\frac{(x-\mu)^2}{2 \sigma^2}} \]





\[ z = \frac{x-\mu}{\sigma} \] \(\longrightarrow\) \(\longrightarrow\) \(\longrightarrow\)

Standard normal distribution

\[ f(x) = \frac{1}{\sqrt{2\pi}} \, \mathrm{e}^{-\frac{1}{2}x^2} \]

t-Distribution \(\mathbf{t}(x, df)\)

  • additional parameter “degrees of freedom” (df)
  • used for confidence intervals and statistical tests
  • converges to the normal distribution for \(df \rightarrow \infty\)

Dependency of the t-value on the number of df

df 1.00 4.00 9.00 19.00 29.00 99.00 999.00
t 12.71 2.78 2.26 2.09 2.05 1.98 1.96

Logarithmic normal distribution (lognormal)

  • many processes in nature do not follow a normal distribution
  • limited by zero on the left side
  • large extreme values on the right side

Examples: discharge of rivers, nutrient concentrations, algae biomass in a lakes

Parent distribution of the lognormal

  • log from values of a lognormal distribution \(\rightarrow\) normal parent distribution.
  • lognormal distribution is described by parameters of log-transformed data \(\bar{x}_L\) and \(s_L\)
  • the the antilog of \(\bar{x}_L\) is the geometric mean

Binomial distribution

  • number of successful trials out of \(n\) total trials with success probability \(p\).
  • How many “6” with probability \(1/6\) in 3 trials?
  • medicine, toxicology, comparison of percent numbers
  • similar, but without replacement: hypergeometric distribution in lottery

Poisson distribution

  • distribution of rare events, a discrete distribution
  • mean and variance are equal (\(\mu = \sigma^2\)), resulting parameter “lambda” (\(\lambda\))
  • Examples: bacteria counting on a grid, waiting queues, failure models

Quasi-poisson if \(\mu \neq \sigma^2\)

  • If \(s^2 > \bar{x}\): overdispersion
  • if \(s^2 < \bar{x}\): underdispersion

Confidence interval

– depends only on \(\lambda\) resp. the number of counted units (\(k\))

Typical error of cell counting: 95% confidence interval

counts 2 3 5 10 50 100 200 400 1000
lower 0 1 2 5 37 81 173 362 939
upper 7 9 12 18 66 122 230 441 1064

Testing for distribution


Sometimes we want to know whether a data set belongs to a specific type of distribution. Though this sounds easy, it appears quite difficult for theoretical reasons:

  • statistical tests check for deviations from the null hypothesis
  • but here we want to test the opposite, if \(H_0\) is true

This is in fact impossible, because “not significant” means only that a potential effect is either not existent or just too small to be detected. On the opposite, “significantly different” includes a certain probability of false positives.

However, most statistical tests do not require perfect agreement with a certain distribution:

  • t-test and ANOVA assume normality of residuals
  • due to the CLT, the distribution of sums and mean values converges to normal

Shapiro-Wilks-W-Test ?

\(\rightarrow\) Aim: tests if a sample conforms to a normal distribution

x <- rnorm(100)
shapiro.test(x)

    Shapiro-Wilk normality test

data:  x
W = 0.99064, p-value = 0.7165


\(\rightarrow\) the \(p\)-value is greater than 0.05, so we would keep \(H_0\) and conclude that nothing speaks against acceptance of the normal


Interpration of the Shapiro-Wilks-test needs to be done with care:

  • for small \(n\), the tes is not sensitive enough
  • far large \(n\) it is over-sensitive
  • using Shapiro-Wilks to check normality for t-test and ANOVA is not anymore recommended

Graphical examination of normality

  • \(x\): theoretical quantiles where a value should be found if the distribution is normal
  • \(y\): normalized and ordered measured values (\(z\)-scores)
  • scaled in the unit of standard deviations
  • normal distribution if the points follow a straight line

Recommendation: Use graphical checks. Don’t trust the Shapiro Wilks!

Checking distributions for descriptive purposes

In some disciplines, such as hydrology, it is occasionally necessary to know which distribution type best describes a dataset. This is especially critical for Extreme Value Analysis (e.g., the 100-year flood), as the Central Limit Theorem (CLT) does not apply here.

Procedure

  1. Visual Inspection (Focus on the Tails)
    • Histogram (first impression)
    • Q-Q Plots (detailed goodness-of-fit)
  2. Formal Tests (exclusion principle)
    • Tests: Kolmogorov-Smirnov, Cramér-von Mises, Anderson-Darling
    • Reject the distribution type if the p-value is significant (p<0.05).
  3. Model Selection (Choosing the best model)
    • Choose the model that provides the best fit while also demonstrating Parsimony
    • Criteria like AIC/BIC will be covered later
    • Consider the physical plausibility of the parameters.

Transformation

  • allows to apply methods designed for normally distributed data
  • modern methods can handle specific distributions directly, such as binomial, gamma, or Poisson

Transformations for right-skewed data

  • \(x'=\log(x)\)
  • \(x'=\log(x + a)\)
  • \(x'=(x+a)^c\) (\(a\) between 0.5 and 1)
  • \(x'=1/x\) (“very powerful”, i.e. to extreme in most cases)
  • \(x'=a - 1/\sqrt{x}\) (to make scale more convenient)
  • \(x'=1/\sqrt{x}\) (compromise between \(\ln\) and \(1/x\))
  • \(x'=a+bx^c\) (very general, includes powers and roots)

Transformations II

Transformations for count data

  • \(x'=\sqrt{3/8+x}\) (counts: 1, 2, 3 \(\rightarrow\) 0.61, 1.17, 1.54, 1.84, )
  • \(x'=\lg(x+3/8)\)
  • \(x'=\log(\log(x))\) for giant numbers

\(\rightarrow\) consider a GLM with family Poisson or quasi-Poisson instead

Ratios and percentages

  • \(x'=\arcsin \sqrt{x/n}\)
  • \(x'=\arcsin \sqrt{\frac{x+3/8}{n+3/4}}\)

\(\rightarrow\) consider a GLM with family binomial instead

Box-Cox transformation

Estimate optimal transformation from the class of powers and logarithms

\[ y' = y^\lambda \]

  • \(\lambda=0\): use logarithmic transformation
  • boxcox requires a “model formula” or the outcome of a linear model (lm)
library(MASS)
dat <- read.csv("https://tpetzoldt.github.io/datasets/data/prk_nit.csv")
Nit90 <- dat$biovol[dat$group == "nit90"]
boxcox(Nit90 ~ 1)

Box-Cox transformation II

The dotted vertical lines and the horizontal 95,%-line show the confidence limits for possible transformations. Here we see that either a logarithmic transformation (\(\lambda=0\)) or a power of approximately 0.5 are suitable.

It is also possible to obtain the numerical value directly:

bc <- boxcox(Nit90 ~ 1, plotit = FALSE)
str(bc)
List of 2
 $ x: num [1:41] -2 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 ...
 $ y: num [1:41] -237 -228 -220 -212 -204 ...
bc$x[bc$y == max(bc$y)]
[1] 0.2

Note: these are approximate numbers so that it makes not much sense to use more than one decimal.

Box-Cox transformation III

It is also possible to test for joint distribution of all groups at once by using explanatory variables (here group) on the right hand side of the model formula:

dat <- read.csv("https://tpetzoldt.github.io/datasets/data/prk_nit.csv")
boxcox(biovol ~ group, data=dat)

Figure 1: Log-Likelihood profile of a Box-Cox-transformation for the pooled data sets Nit85 and Nit90.

Rank transformation

Example: Spearman correlation


Data set

x <- c(1, 2, 3, 5, 4, 5 ,6,  7)
y <- c(1, 2, 4, 3, 4, 6, 8, 20)

Ranks

rank(x)
[1] 1.0 2.0 3.0 5.5 4.0 5.5 7.0 8.0
rank(y)
[1] 1.0 2.0 4.5 3.0 4.5 6.0 7.0 8.0

Two ways of calculation

cor(x, y, method = "spearman")
[1] 0.8915663
cor(rank(x), rank(y))
[1] 0.8915663




Confidence Intervals

Remember: The central limit theorem (CLT)

Sums of a large number \(n\) of independent and identically distributed random values are normally distributed, independently on the type of the original distribution.

  • we can use methods assuming normal normal distribution for non-normal data
    • if we have a large data set
    • if the original distribution is not “too skewed”
  • required number \(n\) depends on the skewness of the original distribution


Reason: Methods like t-test or ANOVA are based on mean values.

Confidence intervals of the mean


Standard error

\[ s_{\bar{x}} = \frac{s}{\sqrt{n}} \]

  • variability of the mean is half, if we increase the sample size four times (\(2^2\))

Estimation of the 95% confidence interval:

\[ CI_{95\%} = \bigg(\bar{x} - z_{0.975} \cdot \frac{s}{\sqrt{n}}, \bar{x} + z_{0.975} \cdot \frac{s}{\sqrt{n}}\bigg) \]

with \(z_{1-\alpha/2} = z_{0.975} =\) \(1.96\).

\(\rightarrow\) \(2\sigma\) rule

  • interval in which the true mean is found with 95% probability

Difference between sample and confidence intervals


  • prediction interval: characterizes the distribution of the data from the parameters of the sample (e.g. mean, standard deviation). It estimates the range where a single, future observation will likely fall.

  • standard deviation \(s_x\) measures the variability of the original data

  • reconstruct the original distribution if its type is known (e.g. normal, lognormal)


  • confidence interval: characterizes the precision of a statistical parameter, based on its standard error

  • Using \(\bar{x}\) and \(s_\bar{x}\), estimate the interval where we find \(\mu\) with a certain probability

  • less dependent on the original distribution of the data due to the CLT

Use the t-distribution for small samples

\[ CI_{95\%} = \bigg(\bar{x} - t_{0.975, n-1} \cdot \frac{s}{\sqrt{n}}, \bar{x} + t_{0.975, n-1} \cdot \frac{s}{\sqrt{n}}\bigg) \]

  • necessary for small samples: \(n\lessapprox 30\), \(n-1\) degrees of freedom
  • can also be used for \(n>30\)
  • \(t\)-quantile can be found in tables or calculated with the qt()function in R.

Example with \(\mu=50\) and \(\sigma=10\):

set.seed(123)
n <- 10
x <- rnorm(n, 50, 10)
m <- mean(x); s <- sd(x)
se <- s/sqrt(n)
# lower and upper confidence limits
m + qt(c(0.025, 0.975), n-1) * se
[1] 43.92330 57.56922

\(\rightarrow\) the true mean (\(\mu\)=50) is in the interval CI = (43.9, 57.6).

Outliers

  • extremely large or extremely small values are sometimes called “outliers”
  • but, potential outliers can be “extreme values” from a skewed distribution. Excluding them, can be scientific misconduct.
  • a “true” outlier is a value that is not from the population we want to analyze, e.g. a serious measurement error if someone forgot to add a chemical in an analysis.
  • it can also be something interesting, e.g. the result of new phenomenon

\(\Rightarrow\) It can be wrong to exclude values only because they are “too big” or “too small”.

\(\rightarrow\) Try to find the reason, why values are extreme!


\(4 \sigma\)-rule

  • check if a value is more that 4 standard deviations away from the mean value.
  • sample size should be \(n \ge 10\), \(\bar{x}\) and \(s\) are calculated without the potential outlier.
  • similar “rules of thumb” can be found in statistics textbooks.

Outlier test for linear models with Bonferroni correction

  • For linear models and GLMs we can use the Bonferroni outlier test from package car.
library(car)
x <- c(rnorm(20), 12) # the 21st value (=12) is an outlier
outlierTest(lm(x~1))  # x ~ 1 is the null model
   rstudent unadjusted p-value Bonferroni p
21 11.66351         4.1822e-10   8.7826e-09

\(\rightarrow\) The 21st value is identified as an outlier:


Alternative to outlier tests

  • use robust parameters and methods,
    • e.g. median or trimmed mean instead of the arithmetic mean,
    • robust linear regression rlm instead of lm
    • rank-based methods like Spearman correlation
  • Important outliers may be omitted in an analysis, but the the number and extent of outliers must be mentioned!

Extreme values in boxplots


  • extreme values outside the whiskers if more than 1.5 times distant from the box limits, compared to the width of the interquartile box.

  • sometimes called “outliers”.

  • I prefer the term “extreme value”, because they can be regular observations from a skewed or heavy tailed distribution.

Example

par(mfrow=c(1, 3), las=1)
elbe <- read.csv("https://tpetzoldt.github.io/datasets/data/elbe.csv")
discharge <- elbe$discharge
boxplot(discharge, main="Boxplot of discharge")
hist(discharge)
hist(log(discharge - 70))

Discharge data of the Elbe River in Dresden in \(\mathrm m^3 s^{-1}\), data source: Bundesanstalt für Gewässerkunde (BFG), see terms and conditions.

  • left: large number of extreme values, are these outliers?
  • middle: distribution is right-skewed
  • right: transformation (3-parametric lognormal) \(\rightarrow\) symmetric distribution, no outliers!

More in the labs …


https://tpetzoldt.github.io/element-labs/