Pipes

2018-12-22 4 min read rstats

library(tidyverse)

Pipes are a way to order your code in a more readable format.

Let’s say you have a small data table with 10 participant IDs, two columns with variable type A, and 2 columns with variable type B. You want to calculate the mean of the A variables and the mean of the B variables and return a table with 10 rows (1 for each participant) and 3 columns (id, A_mean and B_mean).

One way you could do this is by creating a new object at every step and using that object in the next step. This is pretty clear, but you’ve created 6 unnecessary data objects in your environment. This can get confusing in very long scripts.

# make a data table with 10 subjects
data_original <- tibble(
  id = 1:10,
  A1 = rnorm(10, 0),
  A2 = rnorm(10, 1),
  B1 = rnorm(10, 2),
  B2 = rnorm(10, 3)
)

# gather columns A1 to B2 into "variable" and "value" columns
data_gathered <- gather(data_original, variable, value, A1:B2)

# separate the variable column at the _ into "var" and "var_n" columns
data_separated <- separate(data_gathered, variable, c("var", "var_n"), sep = 1)

# group the data by id and var
data_grouped <- group_by(data_separated, id, var)

# calculate the mean value for each id/var 
data_summarised <- summarise(data_grouped, mean = mean(value))

## `summarise()` regrouping output by 'id' (override with `.groups` argument)

# spread the mean column into A and B columns
data_spread <- spread(data_summarised, var, mean)

# rename A and B to A_mean and B_mean
data <- rename(data_spread, A_mean = A, B_mean = B)

id	A_mean	B_mean
1	0.2304382	3.376790
2	-0.2460394	3.150984
3	0.4908918	2.020612
4	-0.3616511	2.706038
5	-0.5155883	2.599207
6	-0.0474370	1.913186
7	0.1864482	2.065799
8	0.5501416	2.405596
9	0.2093015	1.425043
10	-0.7423514	2.399520

You can name each object data and keep replacing the old data object with the new one at each step. This will keep you environment clean, but I don’t recommend it because it makes it too easy to accidentally run your code out of order when you are running line-by-line for development or debugging.

One way to avoid extra objects is to nest your functions, literally replacing each data object with the code that generated it in the previous step. This can be fine for very short chains.

mean_petal_width <- round(mean(iris$Petal.Width), 2)

But it gets extremely confusing for long chains:

# do not ever do this!!
data <- rename(
  spread(
    summarise(
      group_by(
        separate(
          gather(
            tibble(
              id = 1:10,
              A1 = rnorm(10, 0),
              A2 = rnorm(10, 1),
              B1 = rnorm(10, 2),
              B2 = rnorm(10, 3)), 
            variable, value, A1:B2), 
          variable, c("var", "var_n"), sep = 1), 
        id, var), 
      mean = mean(value)), 
    var, mean), 
  A_mean = A, B_mean = B)

## `summarise()` regrouping output by 'id' (override with `.groups` argument)

The pipe lets you “pipe” the result of each function into the next function, allowing you to put your code in a logical order without creating too many extra objects.

# calculate mean of A and B variables for each participant
data <- tibble(
  id = 1:10,
  A1 = rnorm(10, 0),
  A2 = rnorm(10, 1),
  B1 = rnorm(10, 2),
  B2 = rnorm(10, 3)
) %>%
  gather(variable, value, A1:B2) %>%
  separate(variable, c("var", "var_n"), sep=1) %>%
  group_by(id, var) %>%
  summarise(mean = mean(value)) %>%
  spread(var, mean) %>%
  rename(A_mean = A, B_mean = B)

## `summarise()` regrouping output by 'id' (override with `.groups` argument)

You can read this code from top to bottom as follows:

Make a tibble called data with
- id of 1 to 10,
- A1 of 10 random numbers from a normal distribution with a mean of 0,
- A2 of 10 random numbers from a normal distribution with a mean of 1,
- B1 of 10 random numbers from a normal distribution with a mean of 2,
- B2 of 10 random numbers from a normal distribution with a mean of 3; and then
Gather to create variable and value column from columns A_1 to B_2; and then
Separate the column variable into 2 new columns called varand var_n, separate at character 1; and then
Group by columns id and var; and then
Summarise and new column called mean as the mean of the value column for each group; and then
Spread to make new columns with the key names in var and values in mean; and then
Rename to make columns called A_mean (old A) and B_mean (old B)

You can make intermediate objects whenever you need to break up your code because it’s getting too complicated or you need to debug something.

You can debug a pipe by running just the first few functions by highlighting from the beginning to just before the pipe you want to stop at. Try this by highlighting from data <- to the end of the separate function and typing cmd-return. What does data look like now?

R tidyverse pipes

Lisa DeBruine

Professor of Psychology

Lisa DeBruine is a professor of psychology at the University of Glasgow. Her substantive research is on the social perception of faces and kinship. Her meta-science interests include team science (especially the Psychological Science Accelerator), open documentation, data simulation, web-based tools for data collection and stimulus generation, and teaching computational reproducibility.

Pipes

Lisa DeBruine

Professor of Psychology

Related