Topics: Data Cleaning
These slides were last updated on August 07, 2025
Renaming: change variable names (column names)
Recoding: change values/levels of categorical variables (column values; e.g., inside a column)
What are some common scenarios where you’d want to rename variable names or recode variable values?
Flipper Length (mm)
→ flipper_length_mm
)"Good"
and "GOOD"
→ "good"
)"FEMALE"
/"MALE"
to 0
/1
for a regression modelTo change variable names (column names) the most common method is rename()
Change the name of the variable studyName
to study_name
:
select()
to check the variable Comments
in penguins_raw
rename()
to rename Comments
to notes
select()
to check your resultOnce done, copy your code here to share it.
To change variable values (usually levels of categorical variables), we learn two methods.
Change the levels of the categorical variable Sex
(MALE becomes 1, FEMALE becomes 0) with method 1:
Change the levels of the categorical variable Sex
(MALE becomes 1, FEMALE becomes 0) with method 2:
# mutate + case_when
penguins_raw %>%
mutate(Sex = case_when(Sex == "MALE" ~ 1,
Sex == "FEMALE" ~ 0,
TRUE ~ NA_real_))
# like for method 1 (previous code) save results to keep changes and compare
Note
With case_when()
each logical condition ~ value
pair acts like if → then:
TRUE ~ NA_real_
tells R: “If no previous condition was met, then return NA as as a number”count()
to check the variable Species
in penguins_raw
count()
to verify both resultsOnce done, copy your code here to share it.
rename()
mutate()
Note
For recoding, we use mutate()
because our first goal is changing the column’s values. We learned two methods:
mutate(Sex = recode(Sex, "MALE" = 1, "FEMALE" = 0))
mutate(Sex = case_when(sex == "MALE" ~ 1, sex == "FEMALE" ~ 0))
Function | What It Changes | Syntax + Example | Tips |
---|---|---|---|
rename() |
Column names | rename(new_name = old_name) rename(notes = Comments) |
No quotes around variable names |
recode() |
Column values | recode(variable, "old" = new) recode(Sex, "MALE" = 1) |
Check function doc to see when quotes are needed |
case_when() |
Column values | case_when(variable == "old" ~ new) case_when(Sex == "MALE" ~ 1) |
Check function doc to see when quotes are needed |
Note
All recoding is typically done inside mutate()
.
.
or _
flipper_length_mm
flipper.length.mm
flipper.length_mm # valid but poor style
FlipperLengthMm # valid but poor style
TRUE
, NULL
, if
, function
)?Reserved
in the Console for the full listFlipper Length (mm)
@_flipper_length_mm
flipper_ length_mm
flipper-length-mm
.flipper.length.mm
Which of the following are valid names?
3_religion
#3_religion
q3_religion
q3.religion
q3-religion
q3 religion
TRUE
Tip
For best coding style, use snake_case for all your variables names and keep them to three words maximum. Example: q3_religion
You should avoid creating non-syntactic names, BUT you’ll often encounter them, especially in datasets not created in R (from Excel or other external sources). If you don’t handle them properly, R will throw errors when you try to use them.
1. Use backticks to refer to them (e.g., `Flipper Length (mm)`
)
2. Use rename()
to change them to syntactic names
Non-syntactic names will break code if you forget to wrape them in backticks, so renaming avoids issues.
Imagine you are working on political ideology by country, assembled by someone else. The data are in Excel and when you imported them in R they look like this:
df <- tibble(country = c("Italy", "Germany", "France", "Italy", "United States"),
`4 ideology` = c("communism", "fascism", "anarchism", "fascism", "capitalism"))
To use the non-syntactic variable name without changing it and without errors, you must use backticks:
Try this in R:
glimpse(penguins_raw)
or str(penguins_raw)
and identify non-syntactic variables names in this raw datasetselect()
without backticks: what happens?rename()
to give the variable a syntactic valid nameOnce done, copy your code here to share it.
R distinguishes two types of missing data:
NA
or NaN
values in the datasetIn this course, we focus on explicit missing data. For implicit missing data, see R for Data Science Chapter 18
Note
Explicit = value is missing as NA
(Not Available) or NaN
(Not a Number)
Implicit = value was never recorded (row or cell is absent)
Any operation involving a missing value will also return a missing value (see Chapter 12.2.2 Missing values for more):
We’ll review three main tools:
is.na()
– to detect missing valuesna.rm = TRUE
– to ignore missing valuesdrop_na()
– to remove missing valuesUse is.na()
to find the missing values in a specific variable. It returns TRUE
for missing values, and FALSE
otherwise.
Check for missing values in the penguins_raw
dataset:
# using base R syntax
sum(is.na(penguins_raw$Sex))
table(is.na(penguins_raw$Sex))
# using tidyverse syntax
penguins_raw %>% summarize(sum(is.na(Sex)))
penguins_raw %>% count(is.na(Sex))
# filter rows where sex is missing
filter(penguins_raw, is.na(Sex)) # correct
filter(penguins_raw, Sex == NA) # incorrect
Use na.rm = TRUE
to exclude missing values when performing calculations. Often used with summarize()
when calculating things like mean, sum, standard deviation.
penguins_raw %>% summarize(avg_mass = mean(`Body Mass (g)`, na.rm = TRUE))
penguins_raw %>% summarize(sum_mass = sum(`Body Mass (g)`, na.rm = TRUE))
Tip
The command na.rm = TRUE
does not remove missing data from the variable(s), it just skips them for that operation, but they are not dropped!
Use drop_na()
to remove rows with missing values. Either across all columns or in a specific column.
Drop missing values in one specific column (preferred):
Warning
Be careful with drop_na()
as it removes entire rows, which may unintentionally filter out relevant data. Check which variable(s) you are dropping, and avoid using it blindly across all columns.
penguins_raw
dataset and rename Flipper Length (mm)
to flipper_length_mm
. Save the result as a new dataframe, e.g., penguins_clean
or p
Use the new dataframe with the renamed variable for the tasks below:
Detect missing values: use is.na()
and sum()
to count how many are missing in the variable flipper length
Exclude missing from calculations: use na.rm = TRUE
inside mean()
to calculate the average flipper length
Drop missing values: use drop_na()
to remove rows with missing values in flipper length
Once done, share your code here.
Main functions to replace or fill missing values:
replace_na()
– replace missing values with a specified valuefill()
– carry values forward or backward (from the package tidyr
)coalesce()
– return the first non-missing value across multiple columnsSee Chapter 18 of R for Data Science for more.
rename()
recode()
and case_when()
is.na()
to detectna.rm = TRUE
to ignoredrop_na()
to removeClick on the icon bottom-right corner > Tools > PDF Export Mode > Print as a Pdf