01-Introduction

Applied Statistics – A Practical Course

Thomas Petzoldt

2025-09-16

Preface

Goals of the course


  1. Introduction to “Data Science”
  2. Statistical concepts and selected methods
    • Statistical parameters
    • Distributions and probability
    • Statistical tests
    • Model selection
  3. Practical experience
    • Data strutures
    • Basics of the R language
    • Applications with real and simulated data sets

\(\Rightarrow\) Practical understanding and “statistical feeling”,

\(\rightarrow\) More important than facts learned by heart.

Topics


  1. Basic Concepts of Statistics
  2. An Introduction to R
  3. Statistical Parameters and Distributions
  4. Linear Models
  5. Analysis of Variance
  6. Nonlinear Regression
  7. Time Series Analysis
  8. (Multivariate Statistics)

Material


\(\rightarrow\) Slides and exercises are regularly updated, depending on the progress of the course. Comments are welcome.


  • Written exam at the end of the semester

    \(\rightarrow\) > 50% practical questions

    \(\rightarrow\) Attend the labs!


Questions?

Why Statistics?

\(\rightarrow\) a few examples before we begin

An introductory example

Daily average discharge of River Elbe, pegel Dresden, river km 55.6

date,       discharge
1806-01-01,  472
1806-01-02, 1050
1806-01-03, 1310
1806-01-04, 1020
1806-01-05,  767
1806-01-06,  616
...
2020-10-11,  216
2020-10-12,  204
2020-10-13,  217
2020-10-14,  288
2020-10-15,  440
2020-10-16,  601
2020-10-17,  570
2020-10-18,  516
2020-10-19,  450
2020-10-20,  422
2020-10-21,  396
2020-10-22,  372
2020-10-23,  356
2020-10-24,  357
2020-10-25,  332
2020-10-26,  303
2020-10-27,  302
2020-10-28,  316
2020-10-29,  321
2020-10-30,  331
2020-10-31,  353
2020-11-01,  395

\(>\) 70,000 measurements. How can we analyse this and what does it mean?

Data Source: Bundesanstalt für Gewässerkunde

Plot the last 20 years

Discharge of the Elbe River, gauge station Dresden, data source BfG

What do these data tell us?

  • What is the average discharge? → mean values
  • How much variation is in the data? → variance
  • How likely are droughts or floods? → distribution
  • How precise are our forecasts? → confidence intervals
  • Which factors influence discharge? → correlations

How to start

  • Mean value: 224
  • Median value: 224
  • Standard deviation: 253
  • Range: 2, 4500

Which of these parameters are most appropriate?

Graphics

Boxplots

  • Note the log scale of y!
  • In the right version, whiskers extend to the most extreme data point which is no more than 1.5 times the interquartile range from the box.

Three ways to work with statistics

Descriptive statistics and graphics

  • plots, like in the examples
  • mean values, standard deviations, …
  • interpret raw data

Hypothesis testing

  • distinguish effects from random fluctuations
  • make results more convincing

Statistical modelling

  • measure size of effects (e.g. climate trends)
  • build models that aggregate dependencies
  • machine learning

Statistical hypothesis testing


How likely is it, that our hypothesis is true?


  • Turn scientific into statistical hypothesis
  • Estimate probability (p value) of a given hypothesis

Examples

  • Is a medical treatment successful or not? → \(\chi^2\)-test
  • Does a specific food diet increase yield of a fish farm? → t-test
  • Which factors (e.g. food, temperature, pH) of a combined treatment influence growth of aquatic animals? → ANOVA
  • (How) does observed algal biomass depend on phosphorus?

Statistical modeling


Fit a statistical model to the data

  • Select proper modelling strategy
  • Design statistical models
  • Measure effect size
  • Select the optimum model between different model candidates

Examples

  • Fit a distribution to annual discharge data to estimate the 100 year flood.
  • Fit an ANOVA model to experimental data to see which factors influence the result most.
  • Fit a multiple linear model to climate data to see how much climate trends differ between geographical location.

Example: Compare two mean values

  • A given data set (Dobson, 1983) contains the birth weight (in g) of 12 boys and 12 girls.
  • Has the weight difference something to do with the gender of the babies or is it a purely random fluctuation?

Example: Correlation and regression

  • Dependence of chlorophyll concentration in lakes on phosphorus, a regional data set from Koschel and Scheffler (1985) (left) and from Vollenweider and Kerekes (1980) (right).
  • Which of the two figures has greater predictive power? Why?

How to do this in practice?

  1. Data
  2. Mathematics
  3. Computing

Which data structure is better?


Wide format


station 2021 2022 2023
A 4 20 13
B 4 4 3
C 18 7 6
D 12 17 10
E 2 19 11

Long format

year station value
2021 A 4
2021 B 4
2021 C 18
2021 D 12
2021 E 2
2022 A 20
2022 B 4
2022 C 7
2022 D 17
2022 E 19
2023 A 13
2023 B 3
2023 C 6
2023 D 10
2023 E 11

Example: An algae growth experiment


Wide format

Table 1: Growth of algae within 4 days (relative units)
treat replicate 1 replicate 2 replicate 3
Fertilizer 0.020 -0.217 -0.273
F. open 0.940 0.780 0.555
F.+sugar 0.188 -0.100 0.020
F.+CaCO3 0.245 0.236 0.456
Bas.med. 0.699 0.727 0.656
A.dest -0.010 0.000 -0.010
Tap water 0.030 -0.070 NA
  • NA means “not available”, i.e. a missing value

Data in long format


Advantages

  • looks “stupid” but is better for data analysis
  • dependent variable growth and
    explanation variable treat clearly visible
  • model formula: growth ~ treat
  • easily extensible to \(>1\) explanation variable
treat rep growth
Fertilizer 1 0.020
Fertilizer 2 -0.217
Fertilizer 3 -0.273
F. open 1 0.940
F. open 2 0.780
F. open 3 0.555
F.+sugar 1 0.188
F.+sugar 2 -0.100
F.+sugar 3 0.020
F.+CaCO3 1 0.245
F.+CaCO3 2 0.236
F.+CaCO3 3 0.456

Why using long format?


Advantages

  • Clear and consistent:
    • avoids duplications
    • data structure easier to understand
  • Flexible:
    • for various statistical analyses, e.g. ANOVA, multiple regression, time series
    • easy to transform to wide formats when necessary
  • Compatibile:
    • modern data analysis tools like R and Python prefer long format
    • compatible with data base systems

Therefore:

  • Try to avoid wide format. It can lead to inconsistencies and complications in analysis.
  • Tidy data before doing the analysis and convert wide to long format.


\(\rightarrow\) Lab-exercise with Elbe River time series.

Mathematics


  1. Linear Algebra: The foundation for many statistical methods, especially matrices and vectors.
  2. Calculus: Optimization problems, deriving statistical formulas, understanding function behavior.
  3. Numerical Analysis: Implementation of statistical methods on computers, especially with large or complex datasets.
  4. Probability Theory: Sampling and modeling data, understanding statistical inference, developing algorithms.
  5. Statistical Modeling: Regression analysis, time series analysis, Bayesian modeling, machine learning.


\(\rightarrow\) Proper use of ready-made software packages requires fundamental understanding.

Computing

Required software

Why R?

  1. Statisticians call it “lingua franca” in computational statistics.
    • Extremely powerful
    • No other system has so much statistics
    • Used in statistical research
  2. Free (OpenSource)
    • Free to use
    • Free to modify
    • Free to contribute
  3. Less complicated than its first appearance:
    • Yes, it needs command line programming
    • but: already a single line can do much
    • huge number of books and online scripts

In contrast to other systems Copy & Paste is allowed! – just cite it.

Books

Statistics

  • Well-readable introductions
    • Dalgaard, P., 2008: Introductory Statistics with R. Springer, New York, 2nd edition. (fulltext of the 1st edition freely available)
    • Verzani, J. (2019). Using R for introductory statistics. CRC press.
  • A very well understandable introduction into many fields of statistics, especially regression and time series analysis:

R Programming

  • An introduction to data science using the modern “tidyverse” approach:
    • Wickham, H., Çetinkaya-Rundel, M and Grolemund, G, 2023: R for Data Science. Free ebook: https://r4ds.hadley.nz/