03-Statistical Parameters

Applied Statistics – A Practical Course

Thomas Petzoldt

2025-11-20

Statistical Parameters

\(\rightarrow\) Remember: calculation of statistical parameters is called estimation

Properties of statistical parameters

Unbiasedness: the estimation converges towards the true value with increasing \(n\)
Efficiency a relatively small \(n\) is sufficient for a good estimation
Robustness the estimation is not much influenced by outliers or certain violations of statistical assumptions

Depending on a particular question, different classes of parameters exist, especially measures of location (e.g. mean, median), variation (e.g. variance, standard deviation) or dependence (e.g. correlation).

Measures of location I

Arithmetic mean

\[ \bar{x} = \frac{1}{n} \cdot {\sum_{i=1}^n x_i} \]

Geometric mean

\[ G = \sqrt[n]{\prod_{i=1}^n x_i} \]

more practical: logarithmic form:

\[ G =\exp\Bigg(\frac{1}{n} \cdot {\sum_{i=1}^n \ln{x_i}}\Bigg) \]

avoids huge numbers that make problems for the computer.

Measures of location II

Harmonic mean

\[ \frac{1}{H}=\frac{1}{n}\cdot \sum_{i=1}^n \frac{1}{x_i} \quad; x_i>0 \]

Example:

You drive with 50km/h to the university and with 100km/h back home.
What is the mean velocity?

Result:

1/((1/50 + 1/100)/2) = 1/((0.02 + 0.01)/2) = 1/0.015 = 66.67

Median (central value)

\(n\) uneven: sort data, take the middle value

\[\tilde{x} = x_{(n+1)/2}\] \(n\) even: sort data, take average of the two middle values

\[\tilde{x} = \frac{x_{n/2}+x_{n/2+1}}{2}\]

Example

sample with 7 values	2.9, 7.9, 4.1, 8.8, 9.4, 0.5, 5.3
ordered sample	0.5, 2.9, 4.1, 5.3, 7.9, 8.8, 9.4

\(\Rightarrow\) median: \(\tilde{x} = 5.3\)

\(\Rightarrow\) mean: \(\bar{x} = 5.5571429\)

Trimmed mean

also called “truncated mean”
compromize between the arithmetic mean and median
A certain percentage of smallest and biggest values is ignored (e.g. 10% or 25%) before calculating the arithmetic mean
used also in sports

Example: sample with 20 values, exclude 10% at both sides

0.4, 0.5, 1, 2.5, 2.9, 3.3, 4.1, 4.5, 4.6, 5.3, 5.5, 5.7, 6.8, 7.9, 8.8, 8.9, 9, 9.4, 9.6, 46

\(\rightarrow\) arithmetic mean: \(\bar{x}=7.335\)
\(\rightarrow\) trimmed mean: \(\bar{x}_{t, 0.1}=5.6375\)

median and trimmed mean are less influenced by outliers and skewnes \(\rightarrow\) more robust
but somewhat less efficient

Pseudomedian (Hodges-Lehmann estimator)

The pseudomedian (\(\tilde{x}^*\)) is a robust and efficient estimator for the location parameter. It is calculated as the median of all possible means of two observations each, where the pairs include the self-pairs (\(i=j\)).

\[\tilde{x}^* = \text{median}\left(M_{ij}\right) \text{ mit } M_{ij} = \frac{x_i + x_j}{2} \text{ für } 1 \le i \le j \le n\]

Example

library(Hmisc) 

# Using the same sample as for the median
set.seed(123)
x <- round(runif(7, max = 10), 1)

# Sorting for better readability of the example (not part of the pMedian calculation)
sort(x)

[1] 0.5 2.9 4.1 5.3 7.9 8.8 9.4

# arithmetic mean, median, pseudomedian
c(mean(x), median(x), pMedian(x))

[1] 5.557143 5.300000 5.625000

Note: Strictly speaking, the population parameter is referred to as the ‘pseudomedian’ and the sample parameter as the ‘Hodges-Lehmann estimator’.

Mode: weighting formula

\[\begin{align} D &= x_{lo}+\frac{f_k-f_{k-1}}{2f_k-f_{k-1}-f_{k+1}}\cdot w \\ D &= 18 + \frac{29 - 15}{2 \cdot 29 - 15 - 26} \cdot 2 = 19.65 \end{align}\]

\(f\): class frequency, \(w\): class width

\(k\): the index of the most abundant class, \(x_{lo}\) its lower limit.

Mode: density estimation

Somewhat more computer intensive, where the mode is the maximum of a kernel density estimate.

The mode from the density estimate is then \(D=19.42\).

Measures of variation

Variance

\[ s^2_x = \frac{SQ}{df}=\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1} \]

\(SQ\): sum of squared differences from the mean \(\bar{x}\)
\(df = n-1\): degrees of freedom, \(n\): sample size

Standard deviation

\[s=\sqrt{s^2}\] \(\rightarrow\) same unit as the mean \(\bar{x}\), so they can be directly compared.

In practice, \(s^2\) is often computed with:

\[ s^2_x = \frac{\sum{(x_i)^2}-(\sum{x_i})^2/n}{n-1} \]

Coefficient of variation (\(cv\))

Is the relative standard deviation:

\[ cv=\frac{s}{\bar{x}} \]

useful to compare variations of different variables, independent of their measurement unit
only applicable for data with ratio scale, i.e. with an absolute zero (like meters)
not for variables like Celsius temperature or pH.

Example

Let’s assume we have the discharge of two rivers, one with a \(cv=0.3\), another one with \(cv=0.8\). We see that the 2nd has more extreme variation.

Range

The range measures the difference between maximum and minimum of a sample:

\[ r_x = x_{max}-x_{min} \]

Disadvantage: very sensitive against outliers.

Interquartile range

IQR or \(I_{50}\) omits smallest and biggest 25%
sample size of at least 12 values recommended

\[ I_{50}=Q_3-Q_1=P_{75}-P_{25} \]

Ordered sample

\(Q_1\), \(Q_3\): 1st and 3rd quartiles
\(P_{25}, P_{75}\): 25th and 75th percentile
typically used in boxplots

For normally distributed samples, fixed relationship between \(I_{50}\) and \(s\):

\[ \sigma = E(I_{50}/(2\Phi^{-1}(3/4))) \approx E(I_{50}/1.394) % 2*qnorm(3/4)) \]

where \(\Phi^{-1}\) is the quantile function of the normal distribution.

Median absolute deviation

The median of the absolute differences between median and values.

\[ MAD = \text{median}(|\text{median} - x_i|) \]

frequently used in some communities, rarely used in our field
some programs rescale the MAD with a factor \(1.4826\) to approximate the standard deviation.

\(\rightarrow\) Be careful and check the software docs!

Standard error of the mean

\[ s_{\bar{x}}=\frac{s}{\sqrt{n}} \]

measures the accuracy of the mean
plays a central role for calculation of confidence intervals and statistical tests

Rule of thumb for a sample size of about \(n > 30\):

“Two sigma rule”: the true mean is with 95% in the range of \(\bar{x} \pm 2 s_\bar{x}\)

Important

standard deviation \(s\) measures variability of the sample
standard error \(s_\bar{x}\) measures accuracy of the mean

More about this will be explained in the next sections.

03-Statistical Parameters

Statistical Parameters

Measures of location I

Measures of location II

Median (central value)

Trimmed mean

Pseudomedian (Hodges-Lehmann estimator)

Mode (modal value)

Mode: weighting formula

Mode: density estimation

Multi-modal distribution

Measures of variation

Coefficient of variation (\(cv\))

Range

Interquartile range

Median absolute deviation

Standard error of the mean

Important