x4-Pipelines in R

Applied Statistics – A Practical Course

Thomas Petzoldt

2025-11-20

Prerequisites

The examples in slis slide require the following R packages: packages, that must be installed and loaded.

Installation

install.packages(c("dplyr", "tiryr", "lubridate", "readxl", "ggplot2"))

Loading

library("dplyr")
library("tidyr")
library("lubridate")
library("readxl")
library("ggplot2")

The examples were tested with R 4.3.1 and RStudio 2023.06.1

An introductory example

In a statistics course, two samples of Maple (Acer platanoides) leaves were collected by two groups of students:
group HSE: had the freedom to collect leaves individually from trees close to the institute
group HYB: got a random sample from the supervisor

Hypothesis: sampling bias may affect statistical parameters, especially mean and variance.

Download the file leaves.csv, save it to a working directory and then read it with read.csv:

leaves <- read.csv("leaves.csv")

Have a look at the data:

head(leaves)

  group no length width stalk
1   HSE  1     83    87    74
2   HSE  2    130   153   105
3   HSE  3    140   148   135
4   HSE  4    102   110    94
5   HSE  5    190   151    89
6   HSE  6    225   139    91

A boxplot

boxplot(width ~ group, data=leaves)

Summary statistics

summary(leaves)

    group                 no             length           width      
 Length:126         Min.   :  1.00   Min.   : 37.00   Min.   : 44.0  
 Class :character   1st Qu.: 32.25   1st Qu.: 72.00   1st Qu.: 96.0  
 Mode  :character   Median : 63.50   Median : 90.00   Median :118.5  
                    Mean   : 63.50   Mean   : 95.83   Mean   :117.4  
                    3rd Qu.: 94.75   3rd Qu.:102.00   3rd Qu.:138.5  
                    Max.   :126.00   Max.   :250.00   Max.   :199.0  
     stalk       
 Min.   : 29.00  
 1st Qu.: 62.00  
 Median : 82.00  
 Mean   : 81.89  
 3rd Qu.:101.00  
 Max.   :175.00

Summary statistics per group

different ways to calculate summary statistics
classical method with aggregate

A few examples

aggregate(cbind(length, width, stalk) ~ group, mean, data=leaves)

  group    length    width     stalk
1   HSE 137.82857 139.0286 102.02857
2   HYB  79.67033 109.1319  74.14286

aggregate(cbind(length, width, stalk) ~ group, sd, data=leaves)

  group   length    width    stalk
1   HSE 47.99356 26.07285 26.47917
2   HYB 19.93771 30.90423 25.24087

aggregate(cbind(length, width, stalk) ~ group, min, data=leaves)

  group length width stalk
1   HSE     83    87    52
2   HYB     37    44    29

aggregate is very powerful, but the modern “tidyverse” approach is easier to understand. This is explained in the following.

Summary statistics with the `dplyr`-package

Package dplyr contains two handy functions:

group_by
summarize

The functions can be combined in different ways:

A) Two separate code lines

leaves_grouped <- group_by(leaves, group)
summarize(leaves_grouped, mean = mean(width), sd = sd(width), min = min(width), max = max(width))

\(\ominus\) needs a temporary variable: leaves_grouped

B) One line, group_by enclosed in parentheses

summarize(group_by(leaves, group), mean=mean(width), sd=sd(width), min=min(width), max=max(width))

\(\oplus\) no temporary variables necessary
\(\ominus\) nested parentheses

More streamlined: pipelines

Aim to make code easier to understand:

avoids nested parentheses
avoids temporary variables

Recent versions of R (since 4.1) have built-in support for pipelines:

Native pipeline operator |>

A predecessor was the so-called “magrittr” pipeline operator %>%

Most examples from these slides will work with both pipeline operators.
There are a few differences regarding use of so-called placeholders.
I recommended to prefer native pipes |>

Application of `|>`

The output from the first function is piped to the next

group_by(leaves, group) |>
summarize(mean = mean(width), sd = sd(width), 
          min = min(width), max = max(width))

Or, even more streamlined: start pipeline with the data frame

leaves |>
  group_by(group) |>
  summarize(mean = mean(width), sd = sd(width), 
            min = min(width), max = max(width))

# A tibble: 2 × 5
  group  mean    sd   min   max
  <chr> <dbl> <dbl> <int> <int>
1 HSE    139.  26.1    87   195
2 HYB    109.  30.9    44   199

How it works

The pipe operator |> inserts the output from one function
into the first argument of the next function.

Classical functional style

group_by(leaves, group)

Pipeline style

leaves |> group_by(group)

Summary statistics for all variables

The leaves dataset contains different variables length, width and stalk length.
We can now, in principle, extend summarize:

Add more code rows

leaves |>
  group_by(group) |>
  summarize(mean_l=mean(width),  sd_l=sd(width),  min_l=min(width),  max_l=max(width),
            mean_w=mean(length), sd_w=sd(length), min_w=min(length), max_w=max(length),
            mean_s=mean(stalk),  sd_s=sd(stalk),  min_s=min(stalk),  max_s=max(stalk)
  )

Is copy and paste a good idea?

No, at least not in excess.

Copy and paste can lead to errors.
… and there are more compact and elegant ways.

Tidy your data and use “long” data formats!

Long data formats are more database like and more flexible.

If you are used to working with LibeOffiice or Excel, you will probably prefer “wide” tables that fit well on the computer screen. However, this is not such a good idea for data bases and scripted data science.

Modern data analysis packages like dplyr and ggplot2 mandatorily require the long format.

Long data format (= tidy format)

Put data from all 3 variables in one column: length, width, stalk \(\rightarrow\) value
Identifier column for the variables: name

Wide format

group	no	length	width	stalk
HSE	1	83	87	74
HSE	2	130	153	105
HSE	3	140	148	135
HSE	4	102	110	94
HSE	5	190	151	89
HSE	6	225	139	91
HSE	7	195	165	76
HSE	8	216	135	113
HSE	9	250	195	119
HSE	10	152	168	158

Long format

group	no	name	value
HSE	1	length	83
HSE	1	width	87
HSE	1	stalk	74
HSE	2	length	130
HSE	2	width	153
HSE	2	stalk	105
HSE	3	length	140
HSE	3	width	148
HSE	3	stalk	135
HSE	4	length	102

… … …

Long data format with `pivot_longer`

leaves |> 
  pivot_longer(c("length", "width", "stalk"))

# A tibble: 378 × 4
   group    no name   value
   <chr> <int> <chr>  <int>
 1 HSE       1 length    83
 2 HSE       1 width     87
 3 HSE       1 stalk     74
 4 HSE       2 length   130
 5 HSE       2 width    153
 6 HSE       2 stalk    105
 7 HSE       3 length   140
 8 HSE       3 width    148
 9 HSE       3 stalk    135
10 HSE       4 length   102
# ℹ 368 more rows

Summary statistics for all variables groupwise

leaves |> 
  pivot_longer(c("length", "width", "stalk")) |>
  group_by(group, name) |>
  summarize(mean = mean(value), 
            sd   = sd(value), 
            min  = min(value),
            max  = max(value))

# A tibble: 6 × 6
# Groups:   group [2]
  group name    mean    sd   min   max
  <chr> <chr>  <dbl> <dbl> <int> <int>
1 HSE   length 138.   48.0    83   250
2 HSE   stalk  102.   26.5    52   175
3 HSE   width  139.   26.1    87   195
4 HYB   length  79.7  19.9    37   124
5 HYB   stalk   74.1  25.2    29   144
6 HYB   width  109.   30.9    44   199

Pipes and the assignment operator `<-`

In the examples before, the pipe-output was directly printed to the screen
If we need the result in a subsequent operation, we assign it to a variable as usual with <-

totals <- 
  leaves |> 
  pivot_longer(c("length", "width", "stalk")) |>
  group_by(group, name) |>
  summarize(mean=mean(value), sd=sd(value), min=min(value), max=max(value))

Don’t get confused!

the pipe starts with leaves in the second code line
the direction of the pipeline is from left \(\rightarrow\) right
then the of the complete pipeline is assigned to totals

It follows the convention, that the result of an equation is assigned from right to the left.

Reverse assignment?

“More logical”, but less common would be a consequent left to the right notation with ->

leaves |> 
  pivot_longer(c("length", "width", "stalk")) |>
  group_by(group, name) |>
  summarize(mean=mean(value), sd=sd(value), min=min(value), max=max(value)) ->
  totals

Amelia McNamara used this style in her keynote talk at the 2020 use!R conference about Speaking R on youtube.

But, Headley Wickham discouraged this style, because the -> breaks with mathematical convention and is difficult to spot in the code.

Indentation

totals <- 
  leaves |> 
  pivot_longer(c("length", "width", "stalk")) |>
  group_by(group, name) |>
  summarize(mean=mean(value), sd=sd(value), min=min(value), max=max(value))

The pipeline above shows essentially one single line of code.

To improve readability, code lines should not be longer than 80 characters.
Remember: Line breaks can be at any position, as long as a code line is not complete.
Common style: make a newline after <- and |> and use 2 characters for indentation.

Pipelines, tidy data and plotting

The Clementine orange data set

Clementine orange data set

Samples of clementine oranges, measured, weighed and consumed in a statistic course.
Excel file with two tables:
- long table with the fruits
- shorter table brands with meta data
Data can be downloaded from here.

Read data directly from Excel file

brands  <- read_excel("clementines_2019.xlsx", "Brands")
fruits  <- read_excel("clementines_2019.xlsx", "Fruits")

Clementine orange data set

Table “fruits”

year	brand	weight	width	height
2019	EB	107	61.0	54
2019	EB	100	65.0	55
2019	EB	89	58.0	49
2019	EB	99	62.0	52
2019	EB	99	64.0	58
2019	EB	96	63.0	59
2019	EB	100	54.0	54
2019	EB	89	58.5	47
2019	EB	86	66.0	48
2019	EB	92	43.0	33
2019	EB	102	48.0	33
2019	EB	92	43.0	32

Table “brands”

year	brand	type	kilogram	price
2019	EB	Basic	1.50	2.99
2019	EP	Gold	0.75	1.49
2019	LB	Basic	1.00	1.49
2019	LO	Bio	0.50	1.49
2019	NX	Box	2.30	2.99
2019	NP	Premium	1.00	2.29
2019	NB	Basic	1.00	1.49
2019	NC	Basic	1.00	1.49

Database join

join two tables to bring the information together
In case of left_join, the (larger) main table is at the left.
The tables have two key fields in common: year and brand.
The key fields can be automatically detected or explicitly specified or renamed
\(\rightarrow\) help page of left_join

fruits2 <- left_join(fruits, brands)

no	year	brand	weight	width	height	taste	shop	type	quality	kilogram	price
1	2019	EB	107	61	54	6	Edeka	Basic	Basic	1.5	2.99
2	2019	EB	100	65	55	7	Edeka	Basic	Basic	1.5	2.99
3	2019	EB	89	58	49	10	Edeka	Basic	Basic	1.5	2.99
4	2019	EB	99	62	52	7	Edeka	Basic	Basic	1.5	2.99
5	2019	EB	99	64	58	8	Edeka	Basic	Basic	1.5	2.99
6	2019	EB	96	63	59	8	Edeka	Basic	Basic	1.5	2.99
7	2019	EB	100	54	54	7	Edeka	Basic	Basic	1.5	2.99

Exercise

calculate summary statistics for a single data column (e.g. weight)
re-organize the tables with pivot_longer so that all measurements are in one column
calculate summary statistics for all measurements

Selection of columns and rows: `select` and `filter`

select: select columns
filter: filters rows

fruits2 |>
  select(brand, shop, type, weight, width) |>
  filter(shop %in% c("Edeka", "Lidl"))

# A tibble: 44 × 5
   brand shop  type  weight width
   <chr> <chr> <chr>  <dbl> <dbl>
 1 EB    Edeka Basic    107  61  
 2 EB    Edeka Basic    100  65  
 3 EB    Edeka Basic     89  58  
 4 EB    Edeka Basic     99  62  
 5 EB    Edeka Basic     99  64  
 6 EB    Edeka Basic     96  63  
 7 EB    Edeka Basic    100  54  
 8 EB    Edeka Basic     89  58.5
 9 EB    Edeka Basic     86  66  
10 EB    Edeka Basic     92  43  
# ℹ 34 more rows

Create or transform columns: `mutate`

Example

Transform a variable, e.g. weight by \(x^{1/3}\) into a theoretical mean diameter

fruits2 <- 
  fruits2 |>
  mutate(L_mean = weight^(1/3))

Show results

Classical boxplot

boxplot(L_mean ~ brand, data=fruits2)

Plot in “tidyverse”-style with pipes and `ggplot`

fruits2 |>
  mutate(L_mean = weight^(1/3)) |>
  ggplot(aes(brand, L_mean)) + geom_boxplot()

Another `mutate` example

Let’s compare the measured weight of our fruits with a “theoretical volume” calculated from length and height using the formula of an ellipsoid. This is of course an approximation:

\[ V = 4/3 \pi \cdot \rm (length/2)^2 \cdot height/2 \]

fruits <-
  fruits |>
  mutate(V = 0.001 * 4/3 * pi * (width/2)^2 * height/2, index = weight / V)

library(ggplot2)
fruits |>
  ggplot(aes(weight, index)) + 
  geom_point()

The “+” operator in ggplot looks like a pipeline, but works differently.

It adds elements to a plot.

Color coded points

library(ggplot2)
fruits |> ggplot(aes(weight, index, color=brand)) + geom_point()

Categorial split (faceting) and regression line

fruits |> 
  ggplot(aes(weight, V)) + 
  geom_point() + 
  geom_smooth(method=lm, se=FALSE) + 
  facet_wrap( ~ brand)

Modify font size

fruits |> 
  ggplot(aes(weight, V)) + 
  geom_point() + 
  geom_smooth(method=lm, se=FALSE) + 
  facet_wrap( ~ brand) +
  theme(text = element_text(size=24))

\(\rightarrow\) themes allow to configure “almost everything” …

Discharge of the Elbe River

Elbe River in Dresden 2006-04-01

Read data to R

The example file elbe.csv contains daily discharge of the Elbe River in \(\mathrm{m^3 s^{-1}}\) from gauging station Dresden, river km 55.6. The data are from the Federal Waterways and Shipping Administration (WSV) and where provided by the Federal Institute for Hydrology (BfG).

We can skip downloading and read the file directly from its internet location:

elbe <- read.csv("https://raw.githubusercontent.com/tpetzoldt/datasets/main/data/elbe.csv")

The third column “validated” indicate whether the values were finally approved by WSV and BfG. Data of the 19th century are particularly uncertain. Please consult the file elbe_info.txt for details.

Date and time conversion

Now, let’s extend the elbe data frame by adding information about the day, month, year and day of year. Here function mutate adds additional columns.

Note also that the day of year function in the date and time package lubridate is named yday.

Details about date and time conversion can be found in the lubridate cheatsheet.

library(lubridate) # a tidyverse package for dates
elbe <- mutate(elbe,
               date  = as.POSIXct(date),
               day   = day(date), 
               month = month(date), 
               year  = year(date), 
               doy   = yday(date))

Inspect data structure

If we work with RStudio, may have a look at the “Global Environment” pane and inspect the data structure of the elbe data frame.

date	discharge	validated	day	month	year	doy
1989-01-01	765	TRUE	1	1	1989	1
1989-01-02	713	TRUE	2	1	1989	2
1989-01-03	684	TRUE	3	1	1989	3
1989-01-04	612	TRUE	4	1	1989	4
1989-01-05	565	TRUE	5	1	1989	5
1989-01-06	519	TRUE	6	1	1989	6
1989-01-07	522	TRUE	7	1	1989	7
1989-01-08	524	TRUE	8	1	1989	8
1989-01-09	544	TRUE	9	1	1989	9
1989-01-10	539	TRUE	10	1	1989	10
1989-01-11	606	TRUE	11	1	1989	11
1989-01-12	606	TRUE	12	1	1989	12

Annual summary statistics

Summarize data

## calculate annual mean, minimum, maximum
totals <- elbe |>
  group_by(year) |>
  summarize(mean = mean(discharge), 
            min = min(discharge), 
            max = max(discharge))

Show table of summary statistics

head(totals)

# A tibble: 6 × 4
   year  mean   min   max
  <dbl> <dbl> <dbl> <dbl>
1  1989  268.   122   765
2  1990  217.    89   885
3  1991  189.    97   634
4  1992  267.    89  1090
5  1993  256.    92  1610
6  1994  317.    92  1030

Exercise: Compute monthly discharge mean values and monthly sums.

More about pivot tables

In a section before, we already used pivot_longer to reorganize data. Now we do the opposite and convert a data base table (long data format) into a cross-table (wide data format) and vice versa.
R provides several function pairs for this, so you may see functions like melt and cast or gather and spread.
Recently the two functions pivot_wider and pivot_longer were recommended for this purpose.
- The first argument is a data base table, the other arguments define the structure of the desired crosstable.
- id_cols is the name of a column in a long table that will become the rows
- names_from indicates where the names of the columns are taken from
- values_from is the column with the values for the cross table.

\(\rightarrow\) If more than one value exists for a row x column combination, an optional aggregation function values_fn can be given.

Crosstable with one column per year

elbe_wide <-  elbe |>
  pivot_wider(id_cols = doy, 
              names_from = year, 
              values_from = discharge, 
              values_fn = mean)
elbe_wide

Exercise

Create a suitable crosstable elbe_wide.
Then create a crosstable for monthly maximum discharge over all years.

Back-conversion of a crosstable into a data base table

The inverse case is also possible, e.g. the conversion of a cross table into a data base table. It can be done with the function pivot_longer. The column of the id.vars variable(s) will become identifier(s) downwards.

pivot_longer(elbe_wide, names_to="year", cols=as.character(1989:2019))

Exercise

Try it yourself.

Minimum-Maximum plot with summarize and ggplot2

elbe |> 
  mutate(doy = yday(date)) |>
  group_by(doy) |>
  summarize(max = max(discharge), 
            mean = mean(discharge), 
            min = min(discharge)) |>
  pivot_longer(cols = c("min", "mean", "max"), 
               names_to = "statistic", 
               values_to = "discharge") |>
  ggplot(aes(doy, discharge, color = statistic)) + geom_line()

Exercise

Read the code and try to understand it. Then add a dry and a wet year.

Cumulative sums

Annual cumulative sum plots are a hydrological standard tool used by reservoir managers. We can use the R function cumsum, that by successive cumulation converts a sequence of:

\(x_1, x_2, x_3, x_4, \dots\)

into

\((x_1), (x_1+x_2), (x_1+x_2+x_3), (x_1+x_2+x_3+x_4), \dots\)

Example

x <- c(1, 3, 2, 6, 4, 2, 3)
cumsum(x)

[1]  1  4  6 12 16 18 21

Cumulative sums of the Elbe River

Cummulative sums allow to detect dry and wet years, or periods within years.

If we just use cumsum, we get a cumulative sum for all years:

elbe |> 
  mutate(doy = yday(date), year = year(date)) |>
  filter(year %in% 2000:2010) |>
  group_by(year = factor(year)) |>
  mutate(cum_discharge = cumsum(discharge) * 60*60*24) |>
  ggplot(aes(doy, cum_discharge, color = year)) + geom_line()

The multiplication with \(60 \cdot 60 \cdot 24\) converts \(\rm m^3 s^{-1}\) in \(\rm m^3 d^{-1}\).

Exercises

Repeat the same for other time periods (years).
Which year was the wettest, which one the driest year in total? Find a year with dry spring and wet summer.
Identify some (e.g. 3 or 5) large floods in the historical time series and plot it together.
Modify the commands so that the hydrological year is shown. Note that the German hydrological year goes from 1st November to 31st October of the following year. Other countries have different regulations.

References

Bache, S. M., & Wickham, H. (2022). Magrittr: A forward-pipe operator for r. https://CRAN.R-project.org/package=magrittr

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science (2nd ed.). O’Reiley. https://r4ds.hadley.nz/

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. https://CRAN.R-project.org/package=dplyr

group	no	length	width	stalk
HSE	1	83	87	74
HSE	2	130	153	105
HSE	3	140	148	135
HSE	4	102	110	94
HSE	5	190	151	89
HSE	6	225	139	91
HSE	7	195	165	76
HSE	8	216	135	113
HSE	9	250	195	119
HSE	10	152	168	158

group	no	length	width	stalk
HSE	1	83	87	74
HSE	2	130	153	105
HSE	3	140	148	135
HSE	4	102	110	94
HSE	5	190	151	89
HSE	6	225	139	91
HSE	7	195	165	76
HSE	8	216	135	113
HSE	9	250	195	119
HSE	10	152	168	158

x4-Pipelines in R

Prerequisites

An introductory example

A boxplot

Summary statistics

Summary statistics per group

Summary statistics with the dplyr-package

More streamlined: pipelines

Application of |>

How it works

Summary statistics for all variables

Tidy your data and use “long” data formats!

Long data format (= tidy format)

Long data format with pivot_longer

Summary statistics for all variables groupwise

Pipes and the assignment operator <-

Reverse assignment?

Indentation

Pipelines, tidy data and plotting

The Clementine orange data set

Clementine orange data set

Clementine orange data set

Database join

Exercise

Selection of columns and rows: select and filter

Create or transform columns: mutate

Plot in “tidyverse”-style with pipes and ggplot

Another mutate example

Color coded points

Categorial split (faceting) and regression line

Modify font size

Discharge of the Elbe River

Read data to R

Date and time conversion

Inspect data structure

Annual summary statistics

More about pivot tables

Crosstable with one column per year

Back-conversion of a crosstable into a data base table

Minimum-Maximum plot with summarize and ggplot2

Cumulative sums

Cumulative sums of the Elbe River

Exercises

Further reading

References

Summary statistics with the `dplyr`-package

Application of `|>`

Long data format with `pivot_longer`

Pipes and the assignment operator `<-`

Selection of columns and rows: `select` and `filter`

Create or transform columns: `mutate`

Plot in “tidyverse”-style with pipes and `ggplot`

Another `mutate` example

group	no	length	width	stalk
HSE	1	83	87	74
HSE	2	130	153	105
HSE	3	140	148	135
HSE	4	102	110	94
HSE	5	190	151	89
HSE	6	225	139	91
HSE	7	195	165	76
HSE	8	216	135	113
HSE	9	250	195	119
HSE	10	152	168	158