Introduction to R: Lecture 6

Topics: Factors in R, Deep Dive into dplyr

Sabrina Nardin, Summer 2025

Agenda

Define and use Factors in R
Review and expand our knowledge of dplyr “verbs” for data manipulation

These slides were last updated on July 24, 2025

1. Factors in R

What are Factors in R?

Categorical variables are variables with a fixed set of possible values. For example, the species variable in the penguins dataset can only take on one of three values: “Adelie”, “Chinstrap”, or “Gentoo”.

By default, R uses character vectors to store categorical variables. But character vectors don’t preserve meaninfgul order.

To fix this, R uses factors — a special type of vector designed for categorical data.

What are Factors in R?

Character Vector vs. Factor

In R, the most common data structures to store categorical variables are:

Character vector (default): Data type for storing categorical data (or general text) as plain strings. Values have no built-in order or grouping.
Factor (preferred when order matters): Data type for storing categorical data with defined levels. Values are treated as discrete categories, with optional ordering.

Real-world examples of categorical data

You often need factors in R. For example, factors allow sorting categorical variables in your desired order, such as:

Months of the year
Likert scales (e.g., Strongly Agree → Strongly Disagree)
Political parties
Educational attainment levels
Race/ethnicity categories
Etc.

Why use Factors?

Define a character vector with four months and sort it. Copy and paste this code in R and run it. What do you notice?

# Define
x1 <- c("Dec", "Apr", "Jan", "Mar")

# Check
x1
class(x1)

# Sort
sort(x1)

Why use Factors?

From the previous example we observe that by default, R sorts character vectors alphabetically!

However, alphabetical order isn’t ideal to sort months — we usually want them in chronological order. To do that in R, we need to convert them to factors.

In the next slides, we learn two common cases that you’ll likely encounter when working with factors:

Converting a Character Vector to a Factor
Converting a Numeric Vector to a Factor

1. Converting a Character Vector to a Factor

Character vectors sort alphabetically by default. To change the order, convert them to factors using factor() and assign the desired levels.

# Character vector with month names
x1 <- c("Dec", "Apr", "Jan", "Mar")
class(x1)

# Define all possibile levels in desired order
month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
                  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

# Convert to factor using those levels
y1 <- factor(x1, levels = month_levels)

# Check
class(y1)
levels(y1)

# Compare sorting
sort(x1)  # Alphabetical
sort(y1)  # Chronological (by factor levels)

2. Converting a Numeric Vector to a Factor

Sometimes categorical data is stored as numbers (e.g., months as 1, 2, 12) in numeric vectors. To convert them to factors with factor(), you need to specify both levels and labels.

# Numeric vector where values represent months
x2 <- c(12, 4, 1, 3)
class(x2)

# Define all possibile numeric values we expect (1 = Jan, ..., 12 = Dec)
month_levels <- 1:12

# Define all labels we want to show for each value
month_labels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
                  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") 

# Convert to factor using levels and labels
y2 <- factor(x2, 
            levels = month_levels,
            labels = month_labels)

# Check
class(y2)
levels(y2)

# Compare sorting
sort(x2)  # Numeric
sort(y2)  # Chronological (by factor levels)

Levels and Labels in Factors

Levels

Define the set of distinct categories a factor can take
Use to tell R the order of theh categories
Use only levels when your input data is already readable (e.g., "Jan", "Feb", …, "Dec")

Labels

Optional names to display when the data uses codes
Use to tell R what to display for each value
Use labels with levels when your input data uses codes (e.g., 1 = "Jan"). Labels are matched to levels, not to raw values

Tip

Use both levels and labels in the factor() function when you want to map specific underlying values (levels) to more human-readable names (labels). The most common use is when your input data is a numeric vector that uses codes (e.g., uses the number 1 for January).

Levels and Labels in Factors

When you input is a Character Vector

Example values: "Jan", "May", "Oct"

Already readable by humans
(Often) only want to control the order
Use only levels
Example: "May" stays "May", but will now sort correctly

When you input is a Numeric Vector

Example values: 1, 7, 12

Not already readable by humans
Want to control the order & add readable labels
Use levels to order & labels to define what to show
Example: 1 becomes "Jan", 2 becomes "Feb", etc.

Levels and Labels in Factors: Examples

Before you run the code below in R, take a moment to predict the output. What do you expect each code chunk to return, and why?

# Example 1 with Input as Character Vector
a <- c("Low", "High", "Medium", "Low")
b <- factor(a, levels = c("Low", "Medium", "High"))
b

# Example 2 with Input as Numeric vector
i <- c(1, 3, 2, 1)
j <- factor(i, levels = c(1, 2, 3),
               labels = c("Low", "Medium", "High"))
j

Common Error: Mismatched Labels and Levels

Try the code below in R. What do you notice?

# Numeric vector representing months
x2 <- c(12, 4, 1, 3)
class(x2)

# Attempt 1
y2 <- factor(x2, labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun"))

# Attempt 2
y2 <- factor(x2, levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), 
                 labels = c("Jan","Mar", "Apr", "Dec"))

# Attempt 3
y2 <- factor(x2, levels = c(1, 2, 3, 4),
                 labels = c("Jan", "Mar", "Apr", "Dec"))

How to Fix It

None of the three attempts from the previous slide works.

Attempts 1 and 2: have a mismatch between the number of labels and the levels. R throws an error.
Attempt 3: R doesn’t throw an error, but the code is incorrect because it forces levels 1 to 4, even though the input vector has values like 12. Since 12 isn’t among the defined levels, it becomes NA, leading to incorrect matches.

You can fix the code from the previous slide in two ways:

# Numeric vector representing months
x2 <- c(12, 4, 1, 3)

# Correct code option 1 (reccomended)
month_labels <- c("Jan", "Feb", "Mar", "Apr", "May", "June", "July", "Aug", "Sept", "Oct", "Nov", "Dec")
y2 <- factor(x2, levels = 1:12,  
                 labels = month_labels)

# Correct code option 2 (works but keeps only used months)
y2 <- factor(x2, levels = c(1, 3, 4, 12),
                 labels = c("Jan", "Mar", "Apr", "Dec"))

The “forcats” Package

The function factor() is the “base R” way to create and manage factors. It’s a foundational tool, and it’s important to learn how it works!

But once you’re comfortable with it, the forcats package (part of the tidyverse) offers cleaner, more powerful tools for working with categorical data.

The “forcats” Package

The forcats package has several functions to work with factors. Below are three commonly used ones:

Function	What It Does	When to Use
`fct_relevel()`	Manually set the order of levels	Similar to `factor(..., levels = ...)` in base R
`fct_reorder()`	Reorder levels based on another variable (e.g., numeric)	Easier to use, great for ordering bars in `ggplot2`
`fct_infreq()`	Reorder levels by frequency (most to least common)	Useful when showing most common categories first

For more functions, see the forcats documentation

💻 Practice: Plotting Tips by Weekday

In this exercise, you’ll learn two things:

to correctly use stat = identity with bar plots
to control the order of categories in bar plots, using both factor() and forcats function called fct_relevel()

Load Data

Copy and run the code below to create this dataset:

library(tidyverse)

df <- tibble(
  week = c("Mon", "Wed", "Fri", "Wed", "Thu", "Sat", "Sat"),
  tip = c(10, 12, 20, 8, 25, 25, 30)
)

df

Our Goal: Create a bar plot with days of the week on the x-axis and the total tip amount on the y-axis (e.g., Saturday should display a bar with a height of $55, etc.).

What’s Wrong With This Plot?

Try this code. What does the height of each bar represent?

ggplot(df, aes(x = week)) +
  geom_bar()

Warning

Why aren’t the bars showing the actual tip amounts?

Because by default, geom_bar() uses stat = "count" to counts row and plot them on the y-axis
To plot the actual values (not counts), use stat = "identity" with both a x and y

Fix the y-axis on this bar plot

To fix this plot, we need to change the default `stat in geom_bar() (from count to identity) and manually specify both the x and y aesthetics. How do we know this? From the documentation or by typing ?geom_bar in the R Console.

ggplot(df, aes(x = week, y = tip)) +
  geom_bar(stat = "identity") +
  labs(title = "Tips by Weekday", x = "Weekday", y = "Tip ($)")

This is much better, but we still do not have the bars nicely ordered….

Fix the order of the bars using factor()

We use base R’s factor() to control the order of weekdays.

Fill in the correct weekday order in the code below:

df %>%
  mutate(week = factor(week, levels = _____ )) %>%   # fill in the order
  ggplot(aes(x = week, y = tip)) +
  geom_bar(stat = "identity") +
  labs(title = "Tips by Weekday", x = "Weekday", y = "Tip ($)")

Do the same but using fct_relevel()

Now we make the same plot using fct_relevel()

Fill the correct weekday order in the code below:

df %>%
  mutate(week = fct_relevel(week, ______ )) %>%   # your order here
  ggplot(aes(x = week, y = tip)) +
  geom_bar(stat = "identity") +
  labs(title = "Tips by Weekday", x = "Weekday", y = "Tip ($)")

Solutions

# With factor
days <- c("Mon", "Wed", "Thu", "Fri", "Sat")
df %>%
  mutate(week = factor(week, levels = days)) %>%
  ggplot(aes(x = week, y = tip)) +
  geom_bar(stat = "identity") +
  labs(title = "Tips by Weekday", x = "Weekday", y = "Tip ($)")

# With fct_relevel
days <- c("Mon", "Wed", "Thu", "Fri", "Sat")
df %>%
  mutate(week = fct_relevel(week, days)) %>%
  ggplot(aes(x = week, y = tip)) +
  geom_bar(stat = "identity") +
  labs(title = "Tips by Weekday", x = "Weekday", y = "Tip ($)")

Q: Why don’t we use labels in factor(week, levels = days)? Because the values ("Mon", etc.) are already readable. You don’t need to change them, unless you want different names (e.g., "Monday", etc.).

Q: Why doesn’t days include all 7 days of the week? We could include all possible levels, and it’s good practice for consistency when using factor(). But other functions, like fct_relevel(), may not add missing levels if those values aren’t present in the input data.

Reflections

Tip

What happens when you don’t set the factor order?
Which method (factor() or fct_relevel()) do you prefer and why? Check the forcats package documentation for more functions, especially fct_reorder() which is straightforward to use
Why is stat = "identity" necessary for this kind of plot?
Always match the number of provided levels and labels inside `factor()``
Tip: HW2 includes a question with an optional part that asks you to do exactly this (see also Lecture 5 in-class materials for further help on this)

Want more practice? Download today’s in-class materials for more practice exercises on working with factors!

2. Deep dive into dplyr functions for data manipulation

Key dplyr Functions (from Lecture 4 Slides)

`function()`	What it does
`filter()`	Selects rows based on values in one or more columns
`arrange()`	Reorders rows based on the values in specified columns
`select()`	Chooses specific columns by name
`rename()`	Renames one or more columns
`mutate()`	Adds new columns or modifies existing ones
`group_by()`	Groups the data by one or more variables for grouped operations
`summarize()`	Reduces each group to a single row using summary statistics (e.g., mean, sum, n)

Tip for Remembering These Functions

Each row is an observation (e.g., one penguin) and each column is a variable (e.g., species, body mass). Some functions works on rows like filter(), arrange() others on columns like select(), mutate(). Think before coding!

More Functions

`function()`	What it does
`relocate()`	Reorders columns by name; works on columns, not rows like `arrange()`
`count()`	Counts observations by group
`n_distinct()`	Counts the number of unique values in a column; often used with `summarize()`
`distinct()`	Returns unique rows based on one or more columns
`across()`	Applies the same operation to multiple columns at once

💻 Practice

We review these verbs using the Rmd tutorial downloadable from today’s in-class materials on the course website

Recap: What We Learned Today

What factors are and how to convert character or numeric vectors to factors
How to control the order of categories using levels and labels
How to fix common geom_bar() plotting issues (e.g., bar heights and order)
Reviewed and expanded dplyr functions for data manipulation

To print these slides as pdf

Click on the icon bottom-right corner > Tools > PDF Export Mode > Print as a Pdf