Introduction to R: Lecture 4

Topics: Intro to dplyr, Pipes

Sabrina Nardin, Summer 2025

Agenda

  1. Intro to dplyr and to Programming as Problem-Solving
  2. Operators
  3. Main dplyr functions
  4. Pipes (%>% or |>)

These slides were last updated on July 24, 2025

Homework 1 Feedback: Great Work!

✅ Achieved Goals

  • Got familiar with Git and GitHub and R Markdown syntax (they take patience and repetition)
  • Shared interesting bios, links, and images… thanks!

💡 Tips for Success

  • Push all required files to GitHub
  • Commit frequently: from 5 to 20 commits per assignment
  • Embrace learning-by-doing (e.g., adding images), but post on Ed Discussion if you run into issues
  • Check our feedback on this homework: we’ll be stricter on the next ones (harder, point-based)
  • AI and Plagiarism: see Syllabus and Lecture 1 Slides

1. Intro to dplyr and to Programming as Problem-Solving

Meet the Palmer Penguins!

The palmerpenguins package includes two datasets (already installed on our Workbench):

  • penguins with clean data of 244 penguins — we use this today
  • penguins_raw with uncleaned version the same data

Penguins by Allison Horst

Meet the Palmer Penguins!

What we know about each of the 244 penguins:

  • Species: Adelie, Chinstrap, Gentoo
  • Island: Biscoe, Dream, Torgersen
  • Bill length
  • Bill depth
  • Flipper length
  • Body mass
  • Sex
  • Year

Bill Measurement

Penguins Dataset Overview

# load packages
library(tidyverse)
library(palmerpenguins)

# explore data
head(penguins)


# A tibble: 8 × 8
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie    Torgersen           39.1          18.7               181        3750
2 Adelie    Torgersen           39.5          17.4               186        3800
3 Adelie    Torgersen           40.3          18                 195        3250
4 Adelie    Torgersen           NA            NA                  NA          NA
5 Chinstrap Dream               43.5          18.1               202        3400
6 Chinstrap Dream               49.6          18.2               193        3775
7 Chinstrap Dream               50.8          19                 210        4100
8 Chinstrap Dream               50.2          18.7               198        3775
# ℹ 2 more variables: sex <fct>, year <int>


Scatterplot: Flipper Length vs. Body Mass

Start from what we know: scatter plot with two numeric variables

Copy/paste this code in R (note we use ggplot2 with defaults to keep the code short, color by species, add a title):

library(tidyverse)
library(palmerpenguins)
head(penguins)

ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species)) +
  labs(title = "Penguins: Body Mass vs. Flipper Length")


💡 How does the relationship between flipper length and body mass differ across species?


Bar Plot: Number of Penguins by Species

Try a different plot: bar plot with counts for one categorical variable

Copy/paste this code in R:

ggplot(data = penguins, aes(x = species)) +
  geom_bar() +
  labs(title = "Count of Penguins by Species")


💡 What does this bar plot tell us about penguin species frequency?


Make, Store, Save a Plot

Only make a plot

ggplot(data = penguins, aes(x = species)) +
  geom_bar() +
  labs(title = "Count of Penguins by Species")

Make a plot and store it to an object

species_count <- ggplot(data = penguins, aes(x = species)) +
  geom_bar() +
  labs(title = "Count of Penguins by Species")
species_count

Make a plot, store it, and save it in your current working directory

species_count <- ggplot(data = penguins, aes(x = species)) +
  geom_bar() +
  labs(title = "Count of Penguins by Species")
species_count
ggsave("penguins-species-count.png", plot = species_count)

Data Manipulation with dplyr

Today we introduce a second package from the tidyverse: dplyr for data manipulation.

  • Designed for manipulating data frames and tibbles
  • Includes intuitive, clearly named functions for common tasks like filter() to filter rows based on conditions, summarize() to calculate summaries (e.g., averages), group_by for grouped operations, etc.

Let’s work through two questions that require us to use these three dplyr functions!

Programming is Problem-Solving

Penguins Dataset:

# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>


💡 Q1: What is the average body mass of an Adelie penguin?

💡 Q2: What is the average body mass for each penguin species (three species)?

Do not write code! Think about the logical/conceptual steps you’d give R. We’ll translate them into code together.


Q1: What is the average body mass of an Adelie penguin?

Penguins Dataset:

# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>


Instructions to answer the first question:

  1. Identify the data and variables you need
  2. Filter only the observations (rows) where species is Adelie
  3. Calculate the mean of the variable body_mass_g for this group

Open R: let’s turn to these steps into code using dplyr


Q2: What is the average body mass for each penguin species?

Penguins Dataset:

# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>


Instructions to answer the second question:

  1. Identify the data and variables you need
  2. Group the observations (rows) by species
  3. Calculate the mean of the variable body_mass_g for all groups

Open R: let’s turn to these steps into code using dplyr


2. Operators

Assignment Operators

x <- 5                # assign 5 to an object
mean(x = c(1, 2, 3))  # use = to specify an argument inside a function


penguins_species <- group_by(.data = penguins, species)


Logical Operators

x == x    # is equal (TRUE or FALSE)
x != y    # is not equal (TRUE or FALSE) 
x < y     # less than
x <= y    # less than or equal to 
y > x     # more than
y >=      # more than or equal to


adelie <- filter(penguins, species == "Adelie")


no_adelie <- filter(penguins, species != "Adelie")


heavy <- filter(penguins, body_mass_g > 4500)


More Logical Operators

x | y     # EITHER x OR y has to be true
x & y     # BOTH x AND y have to be true
x &! y    # x AND NOT y (x is true AND y is false)


Example use of | operator. What does this code return?

filter(.data = penguins, species == "Adelie" & species == "Chinstrap")


Example use of & operator. What does this code return?

filter(.data = penguins, species == "Adelie" | species == "Chinstrap")


The OR operator can be used with long or short syntax

x | y     # EITHER x OR y has to be true
x & y     # BOTH x AND y have to be true
x &! y    # x AND NOT y (x is true AND y is false)


Example use of | operator with extended syntax:

penguins_adelie_chin <- filter(.data = penguins, 
                                species == "Adelie" | species == "Chinstrap")


Same code with shorter syntax:

penguins_adelie_chin <- filter(.data = penguins, 
                                species %in% c("Adelie", "Chinstrap"))


💻 Practice: Logical Operators with filter()

Logical operators are often used together with the filter() function from dplyr

Practice using them with the penguins dataset:

  • Task 1: Get all Adelie penguins with flipper length greater than or equal to 180 mm

  • Task 2: Get all penguins on Dream and Torgersen islands that are not female

Share your code here: https://codeshare.io/5zlNLE

3. Main dplyr functions

Recap of What Learned so Far

Conceptually, any data transformation using dplyr requires us to:

  1. Identify the data frame and the variables we need

  2. Use dplyr functions to tell R what action to take on which variable(s). These functions:

    • Act like verbs in a sentence: they express what to do with the data
    • Can be combined to perform complex operations
  3. Save the result, usually into a new object (a new dataframe)

Key dplyr Functions

The package dplyr has many functions, but you don’t need to memorize them all!

OUR GOALS: Memorize the key functions + Know where to look up the rest dplyr.tidyverse.org

The next slide summarizes the most common dplyr functions to memorize.

Key dplyr Functions

function() What it does
filter() Selects rows based on values in one or more columns
arrange() Reorders rows based on the values in specified columns
select() Chooses specific columns by name
rename() Renames one or more columns
mutate() Adds new columns or modifies existing ones
group_by() Groups the data by one or more variables for grouped operations
summarize() Reduces each group to a single row using summary statistics (e.g., mean, sum, n)


Tip for Remembering These Functions

Each row is an observation (e.g., one penguin) and each column is a variable (e.g., species, body mass). Some functions works on rows like filter(), arrange() others on columns like select(), mutate(). Think before coding!


Unpacking group_by()

This function tells R to temporarily group the data by one or more variables, so the next function runs within each group. For example, this code groups the data by species, so whatever runs next (here summarize) happens for each species separately:

grouped <- group_by(penguins, species)
summarize(grouped, avg_mass = mean(body_mass_g, na.rm = TRUE))

What group_by() Does

group_by() doesn’t change your data: it changes how the next function treats the data. It’s usually used right before summarize(), fiter(), mutate(), or arrange() to make those functions run once per group, not across the whole dataset.

💻 Practice

What is the average body mass for Adelie penguins by sex?


First, THINK: How would you approach this question conceptually? Break it down into clear and simple steps before coding.

Then, CODE: Translate those steps into R using the appropriate dplyr functions.

Hint: You’ll need to use three dplyr functions and in the best order

Share your code here: https://codeshare.io/5zlNLE


🧠 Filter or Group First?

Task: Calculate the average body mass for Adelie penguins by sex.


  • ✅ Filter for Adelie, then group by sex, then summarize
  • Group by sex, then filter for Adelie, then summarize
  • Group by species, filter for Adelie, then group by sex, then summarize
  • Filter by both species and sex, then group, then summarize
  • Group by both species and sex, then summarize, then filter

The first is the best approach in this case. The other approaches are all correct (try them out!), but are less readable or do more work than needed.

The next slide compares the first two, which are the most common.


🧠 Filter or Group First?

Task: Calculate the average body mass for Adelie penguins by sex.


Filter → Group (Best Practice here)

  • Filter first to keep only Adelie penguins
    → reduces rows right away
  • Group the filtered data by sex
    → only relevant data is grouped
  • Summarize to get average

Group → Filter (Works but Not Ideal here)

  • Group all penguins by sex
    → includes extra, unneeded data
  • Filter to keep only Adelie penguins
    → discards part of what was grouped
  • Summarize to get average


🧠 Filter or Group First?

The best order depends on the task (think first!) but here’s a rule of thumb:


Filter first when you can reduce the data before grouping. Example: Calculate the average body mass for Adelie penguins by sex.

adelie <- filter(penguins, species == "Adelie")
adelie_by_sex <- group_by(adelie, sex)
result <- summarize(adelie_by_sex, 
                    avg_mass = mean(body_mass_g, na.rm = TRUE))


Group first only when your filter depends on group-level summaries. Example: Calculate the average body mass only for species whose average bill length is over 40 mm.

by_species <- group_by(penguins, species)
avg_bill_high <- filter(by_species, mean(bill_length_mm, na.rm = TRUE) > 40)
result <- summarize(avg_bill_high, 
                    avg_mass = mean(body_mass_g, na.rm = TRUE))

4. Pipes %>% or |>

Pipes to Chain Commands

Pipes allow you to write a sequence of operations by passing the result of one function into the next making your code more readable and logical. Compare these two versions of the same code to calculate the average body mass for Adelie penguins by island.

Without pipes:

adelie <- filter(penguins, species == "Adelie")
adelie_island <- group_by(adelie, island)
adelie_avg_mass_island <- summarize(adelie_island, body_mass = mean(body_mass_g, na.rm = TRUE))

With pipes (%>% or |>):

adelie_avg_mass_island <- penguins %>%
  filter(species == "Adelie") %>%
  group_by(island) %>%
  summarize(body_mass = mean(body_mass_g, na.rm = TRUE))

Multiple Ways to Write R Code — Pipes Are Often the Best Choice

Pipes are great and our ultimate goal. But there are several ways to write the same R code.

In fact, R didn’t have pipes for a long time!

Let’s compare different ways to write the same code…

Four Different Options to Code This Task

Task: Calculate the average body mass for Adelie penguins by island.


Strategy: Break Down the Task Before You Code it!

  • Identify data and variable needed
  • Filter the data for rows where species is Adelie
  • Group the filtered data by island
  • Calculate the average body mass for each group


Option 1: Save each step in a new data frame

penguins_adelie <- filter(penguins, species == "Adelie")
penguins_adelie_island <- group_by(penguins_adelie, island)
penguins_final <- summarize(penguins_adelie_island, 
                            body_mass = mean(body_mass_g, na.rm = TRUE))
print(penguins_final)
# A tibble: 3 × 2
  island    body_mass
  <fct>         <dbl>
1 Biscoe        3710.
2 Dream         3688.
3 Torgersen     3706.


✅ This is valid code.

⚠️ Drawback: You must save each intermediate object. This can clutter your environment, increase R memory usage with large datasets, and make your code more prone to typos. Shorter names for each step might reduce typos but sacrifice clarity, which is not good for self-documentation.


Option 2: Replace the original data frame

penguins <- filter(penguins, species == "Adelie")
penguins <- group_by(penguins, island)
penguins <- summarize(penguins, body_mass = mean(body_mass_g, na.rm = TRUE))
print(penguins)
# A tibble: 3 × 2
  island    body_mass
  <fct>         <dbl>
1 Biscoe        3710.
2 Dream         3688.
3 Torgersen     3706.


✅ This also works, but it’s not good practice.

⚠️ Drawback: It overwrites the original dataset. If something goes wrong midway, you’ll need to re-run everything from scratch.

Warning

This approach is risky — especially when working with important datasets. Always keep a copy of your original data before modifying it.


Option 3: Function composition

data(penguins)
summarize(group_by(filter(penguins, species == "Adelie"), island), 
          body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 3 × 2
  island    body_mass
  <fct>         <dbl>
1 Biscoe        3710.
2 Dream         3688.
3 Torgersen     3706.


✅ This also works, and some people like this style.

⚠️ Drawback: It’s harder to read and debug. You must follow it from the inside out, which makes it harder to read for humans. If something breaks, it’s difficult to isolate where the error is happening as you can’t easily inspect intermediate results.


Option 4: Pipes (The Winner!)

penguins %>%
  filter(species == "Adelie") %>%
  group_by(island) %>%
  summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 3 × 2
  island    body_mass
  <fct>         <dbl>
1 Biscoe        3710.
2 Dream         3688.
3 Torgersen     3706.


✅ This is valid and readable code — without the drawbacks of the previous options.

💡 Why pipes? The pipe operator (you can write it as %>% or |>) passes the result of one function to the next, making your code easy to read from top to bottom. Pipes emphasize actions, not object names and you can read the code like a recipe:

  • Start with the dataset
  • Filter for Adelie penguins
  • Group by island
  • Summarize body mass


Common Errors with Pipes: Examples using flights data

We use a built-in dataset of all flights (n = 336,776) that departed from NYC in 2013.

library(nycflights13)
data(flights)
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Common Errors with Pipes: Examples using flights data

Note the use of glimpse() vs head() to explore the dataset. In this case, glimpse() is more useful. Why?

library(nycflights13)
data(flights)
glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Common Pipe Errors: Example 1

What’s wrong with this code?

Before we debug it, let’s first think through what the code is trying to do conceptually.

Invalid code

delays <- flights %>% 
  by_dest <- group_by(dest) %>% 
  delay <- summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) %>% 
  d <- filter(count > 20)

Common Pipe Errors: Example 1

What’s wrong with this code?

Before we debug it, let’s first think through what the code is trying to do conceptually.

Invalid code

delays <- flights %>% 
  by_dest <- group_by(dest) %>% 
  delay <- summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) %>% 
  d <- filter(count > 20)
  • Take the flights dataset
  • Group flights by destination, using the variable dest
  • Count the delayed flights and store the result in a new variable called count
  • Calculate the average arrival delay, using arr_delay, and store the result in a new variable called delay
  • Remove destinations with fewer than 20 flights
  • Question: Why filter(count > 20) to remove destinations with fewer than 20 flights?

Common Pipe Errors: Example 1

What’s wrong with this code?

Invalid code

delays <- flights %>% 
  by_dest <- group_by(dest) %>% 
  delay <- summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) %>% 
  d <- filter(count > 20)

Correct code

delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(count > 20)


Important

Don’t assign anything inside a pipe. Use <- only at the start, if you want to save the final result. Do not use it between steps.


Common Pipe Errors: Example 2

What’s wrong with this code?

Invalid code

delays <- flights %>%
  group_by(dest)
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE))
  filter(count > 20)

Common Pipe Errors: Example 2

What’s wrong with this code?

Invalid code

delays <- flights %>%
  group_by(dest)
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE))
  filter(count > 20)

Correct code

delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(count > 20)


Important

Each function in a pipe chain must be connected with %>% to keep the chain going.


Common Pipe Errors: Example 3

What’s wrong with this code?

Invalid code

delays <- flights %>% 
  group_by(.data = flights, dest) %>% 
  summarize(.data = flights,
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) %>% 
  filter(.data = flights, count > 20)

Common Pipe Errors: Example 3

What’s wrong with this code?

Invalid code

delays <- flights %>% 
  group_by(.data = flights, dest) %>% 
  summarize(.data = flights,
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) %>% 
  filter(.data = flights, count > 20)

Correct code

delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(count > 20)


Important

When using pipes, only reference the data frame at the start of the chain. Do not repeat .data = flights in every function as %>% automatically passes the data along.


Common Pipe Errors: Example 4

What’s wrong with this code?

Invalid code

delays <- flights +
  group_by(dest) +
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) +
  filter(count > 20)

Common Pipe Errors: Example 4

What’s wrong with this code?

Invalid code

delays <- flights +
  group_by(dest) +
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) +
  filter(count > 20)

Correct code

delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(count > 20)


Important

The + sign is only for adding layers in ggplot2! Don’t use it to chain dplyr functions: use %>% to pipe data through a sequence of transformations.


💻 Practice

Download today’s in-class exercises from the website for more practice on operator, dplyr, and pipes.

Recap: What We Learned Today

  • Practiced programming as problem-solving: think through tasks before coding
  • Learned R operators
  • Key dplyr functions like filter(), group_by(), summarize(), and more
  • Chained commands with pipes
  • Reviewed common mistakes when using pipes and how to avoid them

To print these slides as pdf

Click on the icon bottom-right corner > Tools > PDF Export Mode > Print as a Pdf