02-Basic Terminology

Applied Statistics – A Practical Course

Thomas Petzoldt

2025-11-20

Basic Principles and Terminology

Goals of statistical analyses
Descriptive and experimental research
The principle of parsimony
Types of variables
Probability
Sample and Population
Random and systematic errors
Population and sample parameters

Goals of statistical analyses

Summarise, condense and describe data (descriptive statistics)
- work efficiently with large data sets
- Estimate statistical parameters, mean values, variation, correlation
Create hypotheses from data (explorative statistics)
- data mining and explorative statistics
- graphical methods, multivariate statistics
Test Hypotheses (statistical inference)
- classical tests, ANOVA, correlation, . . .
- model selection
Plan research (experimental design)
- effect size compared to random error
- experimental layout and required sample size
Statistical modelling
- measure effect size, find best explanation for a problem
- pattern recognition, forecasting, machine learning

Descriptive or experimental research

Descriptive Research

Find effects and relationships between data.
- observation, monitoring, correlations
- the research subject is not manipulated

Experimental Research

Can an expected effect be reproduced?
- manipulation of single conditions
- elimination of disturbances (controlled boundary conditions)
- experimental design as simple as possible

Strong inference requires clear hypothesis and experimental research.

Weak inference derived from observations and data.

\(\rightarrow\) descriptive research delivers the data for creating the hypotheses.

The principle of parsimony

Attributed to an English philosopher from the 14th century (“Occams razor”)

When you have two competing theories that make exactly the same predictions, the simpler one is the better.

In the context ofstatistical analysis and modeling:

models should have as few parameters as possible
linear models should be preferred to non-linear models
experiments should rely on only few assumptions
models should be simplified until they are minimal adequate
simple explanations should be preferred to complex explanations

One of the most important scientific principles

\(\rightarrow\) But nature is complex, over-simplification has to be avoided.

needs critical reflection and discussion

Variables and parameters

y = a + b \(\cdot\) x

variables: everything that is measured or experimentally manipulated, e.g phosphorus concentration in a lake, air temperature, or abundance of animals.
parameters: values that are estimated by a statistical model, e.g. mean, standard deviation, slope of a linear model.

Independent variables (explanation variables, predictors)

are manually controlled or assumed to result from non-controllable factors

Dependent variables (response variables, target variables, predicted variables)

the variables of interest that we try to understand.

Scales of variables

Binary (boolean variable): exactly two states: true/false, 1/0, present or absent.
Nominal: named entities, no order, {red, yellow, green}, list of species.
Ordinal variables (ranks, ordered factors): values or terms with an order {1., 2., 3., …}; {oligotrophic, mesotrophic, eutrophic, polytrophic, hypertrophic}, but not “dystrophic”
Metric: continuous (ideally without steps). Two sub-types:
- Interval scale: allows comparison and differences, but ratios make no sense. (20°C is 10 degrees warmer than als 10°C, but not double)
- Ratio scale: data with an absolute zero, ratios make sense.
  A tree with 2m has double the hight of a tree with 1m.

The “level” of variables increases from binary to ratio scale. It is always possible to convert a higher to a lower level.

Transformation of scales

The “level” of variables increases from binary to ratio scale. It is always possible to convert a higher to a lower level scale:

metric \(\rightarrow\) ordinal: ranking
metric or ordinal \(\rightarrow\) binary: threshold
nominal \(\rightarrow\) binary: assign to two groups

Transformation to a lower scale results in a certain amount of information loss, but allows to use additional methods from the lower-level scale.

Explanation: If we apply rank correlation to metric data, we essentially apply a method for the ordinal scale to metric data. In this case, we loose information about the differences between the values, but also decrease influence of extreme values and outliers.

Transformation from metric to binary can be useful, if the metric data are not precise enough. So for example, counting animals (e.g. wolves) in a certain area may depend on too many factors (structure of the landscape, experience of people, season etc.) so that the exact numbers (abundances) are questionable. In such cases, transformation to a binary scale (present/absent) and using a respective test (e.g. logistic regression or Fisher’s exact test) will be more reliable.

Other examples are the comparison of floods between different rivers, e.g. a large and a small ones, or occurrences of genes in a molecular biological analysis.

Probability

Classical definition

probability \(p\) is the chance of a specific event:

\[ p = \frac{\text{number of selected cases}}{\text{number of all possible cases}} \]

1 or 6 on a dice \(p=2/6\)
problem if denominator becomes infinite

Axiomatic definition

Axiom I: \(0 \le p \le 1\)
Axiom II: impossible events have \(p=0\), safe events have \(p=1\)
Axiom III: for mutually exlusive events \(A\) and \(B\), i.e. in set theory \(A \bigcap B = \emptyset\) holds: \(p(A \bigcup B)= p(A) + p(B)\)

Sample and Population

Sample

Subjects, from which we have measurements or observations

Population

Set of all subjects that had the same chance to become part of the sample.

\(\Rightarrow\) The population is defined by the way how samples are taken

\(\Rightarrow\) Samples should be representative for our intended observational subject.

Sampling strategies

Random sampling

Individuals are selected at random from a given population.
Examples:
- Random selection of sample sites on a grid.
- Random placement of experimental units on a shelf.

Stratified sampling

The population is subdivided into classes of similar subjects (strata).
The strata are separately analysed and then the the information is weighted and combined to infer about the population.
Stratified sampling requires information about the size and representativity of the strata.
Examples: election forecasts, depth layers in a lake, age classes for animals.

Random and systematic errors

Random errors

can be estimated with statistical methods
are eliminated if sample size is large
in large samples, big and small errors average out

Systematic Errors also called bias

can often not easily be estimated with statistical methods alone
knowledge about the considered system
elimination requires calibration using standards, blind values or pairing

Population and sample parameters

“True” parameters of the population

symbolized with greek letters, (\(\mu, \sigma, \gamma\, \alpha, \beta\))
usually unknown
estimated from a sample

“Calculated” parameters from a sample

symbolized with latin letters (\(\bar{x}\), \(s\), \(r^2\), …)
the calculation is done from a sample
statisticians say “estimation” instead of “calculation”
parameters can themselves be treated as a random variable

Expected value

A single measurement \(x_i\) of a random variable \(X\) can be written as the sum of the expected value \(\mathbf{E}(X)\) of the random variable and a random error \(\varepsilon_i\).

\[\begin{align} x_i &= \mathbf{E}(X) + \varepsilon_i\\ \mathbf{E}(\varepsilon)&=0 \end{align}\]

Example:

for a fair dice with 6 eyes, true mean \(\mu\) should be 3.5
in reality it is not exactly known if the dice is a perfect cubus

Example: 3 people with 5 trials:

sample 1:  3 3 2 4 1  mean: 2.6

sample 2:  6 1 1 6 1  mean: 3

sample 3:  6 5 6 6 5  mean: 5.6

Overall mean: \(\bar{x} = 3.73\) is close to \(\mu = 3.5\).