Topics: Factors in R, Deep Dive into dplyr
dplyr
“verbs” for data manipulationThese slides were last updated on July 24, 2025
Categorical variables are variables with a fixed set of possible values. For example, the species
variable in the penguins dataset can only take on one of three values: “Adelie”, “Chinstrap”, or “Gentoo”.
By default, R uses character vectors to store categorical variables. But character vectors don’t preserve meaninfgul order.
To fix this, R uses factors — a special type of vector designed for categorical data.
Character Vector vs. Factor
In R, the most common data structures to store categorical variables are:
Character vector (default): Data type for storing categorical data (or general text) as plain strings. Values have no built-in order or grouping.
Factor (preferred when order matters): Data type for storing categorical data with defined levels. Values are treated as discrete categories, with optional ordering.
You often need factors in R. For example, factors allow sorting categorical variables in your desired order, such as:
Define a character vector with four months and sort it. Copy and paste this code in R and run it. What do you notice?
From the previous example we observe that by default, R sorts character vectors alphabetically!
However, alphabetical order isn’t ideal to sort months — we usually want them in chronological order. To do that in R, we need to convert them to factors.
In the next slides, we learn two common cases that you’ll likely encounter when working with factors:
Character vectors sort alphabetically by default. To change the order, convert them to factors using factor()
and assign the desired levels.
# Character vector with month names
x1 <- c("Dec", "Apr", "Jan", "Mar")
class(x1)
# Define all possibile levels in desired order
month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
# Convert to factor using those levels
y1 <- factor(x1, levels = month_levels)
# Check
class(y1)
levels(y1)
# Compare sorting
sort(x1) # Alphabetical
sort(y1) # Chronological (by factor levels)
Sometimes categorical data is stored as numbers (e.g., months as 1, 2, 12) in numeric vectors. To convert them to factors with factor()
, you need to specify both levels and labels.
# Numeric vector where values represent months
x2 <- c(12, 4, 1, 3)
class(x2)
# Define all possibile numeric values we expect (1 = Jan, ..., 12 = Dec)
month_levels <- 1:12
# Define all labels we want to show for each value
month_labels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
# Convert to factor using levels and labels
y2 <- factor(x2,
levels = month_levels,
labels = month_labels)
# Check
class(y2)
levels(y2)
# Compare sorting
sort(x2) # Numeric
sort(y2) # Chronological (by factor levels)
levels
when your input data is already readable (e.g., "Jan"
, "Feb"
, …, "Dec"
)labels
with levels
when your input data uses codes (e.g., 1
= "Jan"
). Labels are matched to levels, not to raw valuesTip
Use both levels and labels in the factor()
function when you want to map specific underlying values (levels) to more human-readable names (labels). The most common use is when your input data is a numeric vector that uses codes (e.g., uses the number 1 for January).
Example values: "Jan"
, "May"
, "Oct"
levels
"May"
stays "May"
, but will now sort correctlyExample values: 1
, 7
, 12
levels
to order & labels
to define what to show1
becomes "Jan"
, 2
becomes "Feb"
, etc.Before you run the code below in R, take a moment to predict the output. What do you expect each code chunk to return, and why?
Try the code below in R. What do you notice?
# Numeric vector representing months
x2 <- c(12, 4, 1, 3)
class(x2)
# Attempt 1
y2 <- factor(x2, labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun"))
# Attempt 2
y2 <- factor(x2, levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
labels = c("Jan","Mar", "Apr", "Dec"))
# Attempt 3
y2 <- factor(x2, levels = c(1, 2, 3, 4),
labels = c("Jan", "Mar", "Apr", "Dec"))
None of the three attempts from the previous slide works.
Attempts 1 and 2: have a mismatch between the number of labels and the levels. R throws an error.
Attempt 3: R doesn’t throw an error, but the code is incorrect because it forces levels 1 to 4, even though the input vector has values like 12. Since 12 isn’t among the defined levels, it becomes NA
, leading to incorrect matches.
You can fix the code from the previous slide in two ways:
# Numeric vector representing months
x2 <- c(12, 4, 1, 3)
# Correct code option 1 (reccomended)
month_labels <- c("Jan", "Feb", "Mar", "Apr", "May", "June", "July", "Aug", "Sept", "Oct", "Nov", "Dec")
y2 <- factor(x2, levels = 1:12,
labels = month_labels)
# Correct code option 2 (works but keeps only used months)
y2 <- factor(x2, levels = c(1, 3, 4, 12),
labels = c("Jan", "Mar", "Apr", "Dec"))
The function factor()
is the “base R” way to create and manage factors. It’s a foundational tool, and it’s important to learn how it works!
But once you’re comfortable with it, the forcats
package (part of the tidyverse) offers cleaner, more powerful tools for working with categorical data.
The forcats
package has several functions to work with factors. Below are three commonly used ones:
Function | What It Does | When to Use |
---|---|---|
fct_relevel() |
Manually set the order of levels | Similar to factor(..., levels = ...) in base R |
fct_reorder() |
Reorder levels based on another variable (e.g., numeric) | Easier to use, great for ordering bars in ggplot2 |
fct_infreq() |
Reorder levels by frequency (most to least common) | Useful when showing most common categories first |
For more functions, see the forcats documentation
In this exercise, you’ll learn two things:
to correctly use stat = identity
with bar plots
to control the order of categories in bar plots, using both factor()
and forcats
function called fct_relevel()
Copy and run the code below to create this dataset:
library(tidyverse)
df <- tibble(
week = c("Mon", "Wed", "Fri", "Wed", "Thu", "Sat", "Sat"),
tip = c(10, 12, 20, 8, 25, 25, 30)
)
df
Our Goal: Create a bar plot with days of the week
on the x-axis and the total tip
amount on the y-axis (e.g., Saturday should display a bar with a height of $55, etc.).
Try this code. What does the height of each bar represent?
Warning
Why aren’t the bars showing the actual tip amounts?
Because by default, geom_bar()
uses stat = "count"
to counts row and plot them on the y-axis
To plot the actual values (not counts), use stat = "identity"
with both a x and y
To fix this plot, we need to change the default `stat
in geom_bar()
(from count to identity) and manually specify both the x
and y
aesthetics. How do we know this? From the documentation or by typing ?geom_bar
in the R Console.
ggplot(df, aes(x = week, y = tip)) +
geom_bar(stat = "identity") +
labs(title = "Tips by Weekday", x = "Weekday", y = "Tip ($)")
This is much better, but we still do not have the bars nicely ordered….
We use base R’s factor()
to control the order of weekdays.
Fill in the correct weekday order in the code below:
Now we make the same plot using fct_relevel()
Fill the correct weekday order in the code below:
# With factor
days <- c("Mon", "Wed", "Thu", "Fri", "Sat")
df %>%
mutate(week = factor(week, levels = days)) %>%
ggplot(aes(x = week, y = tip)) +
geom_bar(stat = "identity") +
labs(title = "Tips by Weekday", x = "Weekday", y = "Tip ($)")
# With fct_relevel
days <- c("Mon", "Wed", "Thu", "Fri", "Sat")
df %>%
mutate(week = fct_relevel(week, days)) %>%
ggplot(aes(x = week, y = tip)) +
geom_bar(stat = "identity") +
labs(title = "Tips by Weekday", x = "Weekday", y = "Tip ($)")
Q: Why don’t we use labels
in factor(week, levels = days)
? Because the values ("Mon"
, etc.) are already readable. You don’t need to change them, unless you want different names (e.g., "Monday"
, etc.).
Q: Why doesn’t days
include all 7 days of the week? We could include all possible levels, and it’s good practice for consistency when using factor()
. But other functions, like fct_relevel()
, may not add missing levels if those values aren’t present in the input data.
Tip
factor()
or fct_relevel()
) do you prefer and why? Check the forcats
package documentation for more functions, especially fct_reorder()
which is straightforward to usestat = "identity"
necessary for this kind of plot?levels
and labels
inside `factor()``Want more practice? Download today’s in-class materials for more practice exercises on working with factors!
function() |
What it does |
---|---|
filter() |
Selects rows based on values in one or more columns |
arrange() |
Reorders rows based on the values in specified columns |
select() |
Chooses specific columns by name |
rename() |
Renames one or more columns |
mutate() |
Adds new columns or modifies existing ones |
group_by() |
Groups the data by one or more variables for grouped operations |
summarize() |
Reduces each group to a single row using summary statistics (e.g., mean, sum, n) |
Tip for Remembering These Functions
Each row is an observation (e.g., one penguin) and each column is a variable (e.g., species, body mass). Some functions works on rows like filter()
, arrange()
others on columns like select()
, mutate()
. Think before coding!
function() |
What it does |
---|---|
relocate() |
Reorders columns by name; works on columns, not rows like arrange() |
count() |
Counts observations by group |
n_distinct() |
Counts the number of unique values in a column; often used with summarize() |
distinct() |
Returns unique rows based on one or more columns |
across() |
Applies the same operation to multiple columns at once |
levels
and labels
geom_bar()
plotting issues (e.g., bar heights and order)dplyr
functions for data manipulationClick on the icon bottom-right corner > Tools > PDF Export Mode > Print as a Pdf