Introduction to R: Lecture 7

Topics: Data Cleaning

Sabrina Nardin, Summer 2025

Agenda

Data Cleaning: Renaming and Recoding Variables
Data Cleaning: Syntactic vs. Non-syntactic Variable Names
Data Cleaning: Missing Data

These slides were last updated on August 07, 2025

1. Renaming and Recoding Variables

Definitions

Renaming: change variable names (column names)

Recoding: change values/levels of categorical variables (column values; e.g., inside a column)

Uses

What are some common scenarios where you’d want to rename variable names or recode variable values?

You are cleaning up imported data
- The variable name is has issues (e.g., Flipper Length (mm) → flipper_length_mm)
- You need to standardize categories (e.g., "Good" and "GOOD" → "good")
- Etc.
You are preparing data for modeling or visualization
- You want to recode "FEMALE"/"MALE" to 0/1 for a regression model
- Etc.

We work with the Penguins (raw) Data!

# Load libraries and data
library(tidyverse)
library(palmerpenguins)
data(penguins)

# Explore data
head(penguins_raw)
tail(penguins_raw)
rbind(head(penguins_raw, 3), tail(penguins_raw, 3))
glimpse(penguins_raw)

Renaming Variables with rename

To change variable names (column names) the most common method is rename()

Change the name of the variable studyName to study_name:

# check before renaming
str(penguins_raw)
penguins_raw %>% select(studyName)

# rename
penguins_raw %>% rename(study_name = studyName)   # new = old

# remember to save to keep changes
p <- penguins_raw %>% rename(study_name = studyName)
p %>% select(study_name)

💻 Practice Renaming Variables

Use select() to check the variable Comments in penguins_raw
Use rename() to rename Comments to notes
Save the result to a new object
Use select() to check your result

Once done, copy your code here to share it.

Recoding Variables Method 1: with mutate + recode

To change variable values (usually levels of categorical variables), we learn two methods.

Change the levels of the categorical variable Sex (MALE becomes 1, FEMALE becomes 0) with method 1:

# check before recoding
penguins_raw %>% count(Sex)

# mutate + recode
p <- penguins_raw %>%
  mutate(sex = recode(Sex, "MALE" = 1, "FEMALE" = 0))

# compare
penguins_raw %>% count(Sex)
p %>% count(Sex)

Recoding Variables Method 2: with mutate + case_when

Change the levels of the categorical variable Sex (MALE becomes 1, FEMALE becomes 0) with method 2:

# mutate + case_when
penguins_raw %>%
  mutate(Sex = case_when(Sex == "MALE" ~ 1,
                         Sex == "FEMALE" ~ 0,
                         TRUE ~ NA_real_))
  
# like for method 1 (previous code) save results to keep changes and compare

Note

With case_when() each logical condition ~ value pair acts like if → then:

for each row, R looks if the condition is TRUE: “If you find the value MALE in Sex, then convert it to 1”
TRUE ~ NA_real_ tells R: “If no previous condition was met, then return NA as as a number”

💻 Practice Recoding Variables

Use count() to check the variable Species in penguins_raw
Pick method 1 or method 2 to recode the values of that variable into Adelie, Chinstrap, Gentoo
Save the result to a new object
Use count() to verify both results

Once done, copy your code here to share it.

The Role of mutate in Recoding

`rename()`

Specific use: change column names
It changes the column’s name, but leaves the column’s values unchanged

`mutate()`

Many uses: create new columns or modify existing columns values
It changes the column’s values, potentially can also rename the column

Note

For recoding, we use mutate() because our first goal is changing the column’s values. We learned two methods:

Method 1: mutate(Sex = recode(Sex, "MALE" = 1, "FEMALE" = 0))
Method 2: mutate(Sex = case_when(sex == "MALE" ~ 1, sex == "FEMALE" ~ 0))

Rename vs Recode: Syntax Reference

Function	What It Changes	Syntax + Example	Tips
`rename()`	Column names	`rename(new_name = old_name)` `rename(notes = Comments)`	No quotes around variable names
`recode()`	Column values	`recode(variable, "old" = new)` `recode(Sex, "MALE" = 1)`	Check function doc to see when quotes are needed
`case_when()`	Column values	`case_when(variable == "old" ~ new)` `case_when(Sex == "MALE" ~ 1)`	Check function doc to see when quotes are needed

Note

All recoding is typically done inside mutate().

2. Syntactic vs. Non-syntactic Variable Names

Syntactic (Valid) Variable Names in R

Valid Names in R:

Use letters, numbers, and the symbols . or _
But cannot start with a number or symbol

Examples of Valid Names:

flipper_length_mm
flipper.length.mm
flipper.length_mm     # valid but poor style
FlipperLengthMm       # valid but poor style

Non-syntactic (Invalid) Variable Names in R

What Makes a Name Invalid:

Contains spaces or symbols
Starts with a number or symbol
Uses reserved words (e.g., TRUE, NULL, if, function)
Type ?Reserved in the Console for the full list

Examples of Invalid Names:

Flipper Length (mm)
@_flipper_length_mm
flipper_ length_mm
flipper-length-mm
.flipper.length.mm

💻 Practice: Syntactic and Non-Syntactic Names

Which of the following are valid names?

3_religion
#3_religion
q3_religion
q3.religion
q3-religion
q3 religion
TRUE

Tip

For best coding style, use snake_case for all your variables names and keep them to three words maximum. Example: q3_religion

How to Handle Non-syntactic Names — and Why It Matters

You should avoid creating non-syntactic names, BUT you’ll often encounter them, especially in datasets not created in R (from Excel or other external sources). If you don’t handle them properly, R will throw errors when you try to use them.

What to Do:

1. Use backticks to refer to them (e.g., `Flipper Length (mm)`)
2. Use rename() to change them to syntactic names

Non-syntactic names will break code if you forget to wrape them in backticks, so renaming avoids issues.

Working with Non-syntactic Names in Practice

Imagine you are working on political ideology by country, assembled by someone else. The data are in Excel and when you imported them in R they look like this:

df <- tibble(country = c("Italy", "Germany", "France", "Italy", "United States"),
                  `4 ideology` = c("communism", "fascism", "anarchism", "fascism", "capitalism"))

To use the non-syntactic variable name without changing it and without errors, you must use backticks:

select(df, `4 ideology`)

💻 Practice: Syntactic & Non-syntactic Variable Names

Try this in R:

Use glimpse(penguins_raw) or str(penguins_raw) and identify non-syntactic variables names in this raw dataset
Pick one of them, and try accessing it with select() without backticks: what happens?
Use rename() to give the variable a syntactic valid name
Save the result to a new object
Verify the name was changed and you can now access it

Once done, copy your code here to share it.

3. Missing Data

What Are Missing Data?

R distinguishes two types of missing data:

Explicit missing: visible NA or NaN values in the dataset
Implicit missing: data that was never recorded

In this course, we focus on explicit missing data. For implicit missing data, see R for Data Science Chapter 18

Note

Explicit = value is missing as NA (Not Available) or NaN (Not a Number)
Implicit = value was never recorded (row or cell is absent)

How Missing Data Behave

Any operation involving a missing value will also return a missing value (see Chapter 12.2.2 Missing values for more):

NA > 5

sum(c(3, 1, 4, NA)
sum(c(3, 1, 4, NA), na.rm = TRUE)

mean(c(3,1,4,NA))
mean(c(3, 1, 4, NA), na.rm = TRUE)

Common Ways to Handle Missing Data

We’ll review three main tools:

is.na() – to detect missing values
na.rm = TRUE – to ignore missing values
drop_na() – to remove missing values

1. Detect Missing Data with is.na():

Use is.na() to find the missing values in a specific variable. It returns TRUE for missing values, and FALSE otherwise.

Check for missing values in the penguins_raw dataset:

# using base R syntax
sum(is.na(penguins_raw$Sex))
table(is.na(penguins_raw$Sex))

# using tidyverse syntax
penguins_raw %>% summarize(sum(is.na(Sex)))
penguins_raw  %>% count(is.na(Sex)) 

# filter rows where sex is missing
filter(penguins_raw, is.na(Sex))    # correct  
filter(penguins_raw, Sex == NA)     # incorrect

2. Ignore Missing Data with na.rm = TRUE

Use na.rm = TRUE to exclude missing values when performing calculations. Often used with summarize() when calculating things like mean, sum, standard deviation.

penguins_raw %>% summarize(avg_mass = mean(`Body Mass (g)`, na.rm = TRUE))
penguins_raw %>% summarize(sum_mass = sum(`Body Mass (g)`, na.rm = TRUE))

Tip

The command na.rm = TRUE does not remove missing data from the variable(s), it just skips them for that operation, but they are not dropped!

3. Remove Missing Data with drop_na()

Use drop_na() to remove rows with missing values. Either across all columns or in a specific column.

Drop missing values in one specific column (preferred):

penguins_raw %>%
  drop_na(`Body Mass (g)`) %>%
  summarize(avg_mass = mean(`Body Mass (g)`))

Warning

Be careful with drop_na() as it removes entire rows, which may unintentionally filter out relevant data. Check which variable(s) you are dropping, and avoid using it blindly across all columns.

💻 Practice: Handling Missing Data

Rename: use the penguins_raw dataset and rename Flipper Length (mm) to flipper_length_mm. Save the result as a new dataframe, e.g., penguins_clean or p

Use the new dataframe with the renamed variable for the tasks below:

Detect missing values: use is.na() and sum() to count how many are missing in the variable flipper length
Exclude missing from calculations: use na.rm = TRUE inside mean() to calculate the average flipper length
Drop missing values: use drop_na() to remove rows with missing values in flipper length

Once done, share your code here.

Ways to Fill or Replace Missing Data

Main functions to replace or fill missing values:

replace_na() – replace missing values with a specified value
fill() – carry values forward or backward (from the package tidyr)
coalesce() – return the first non-missing value across multiple columns

See Chapter 18 of R for Data Science for more.

Recap: What We Learned Today

How to rename variables using rename()
How to recode values inside a variable using recode() and case_when()
The difference between syntactic and non-syntactic variable names, and how to handle them
How to detect, ignore, or drop missing values using:
- is.na() to detect
- na.rm = TRUE to ignore
- drop_na() to remove

To print these slides as pdf

Click on the icon bottom-right corner > Tools > PDF Export Mode > Print as a Pdf