Introduction to R: Lecture 7

Topics: Data Cleaning

Sabrina Nardin, Summer 2025

Agenda

  1. Data Cleaning: Renaming and Recoding Variables
  2. Data Cleaning: Syntactic vs. Non-syntactic Variable Names
  3. Data Cleaning: Missing Data

These slides were last updated on August 07, 2025

1. Renaming and Recoding Variables

Definitions

Renaming: change variable names (column names)

Recoding: change values/levels of categorical variables (column values; e.g., inside a column)

Uses

What are some common scenarios where you’d want to rename variable names or recode variable values?

  • You are cleaning up imported data
    • The variable name is has issues (e.g., Flipper Length (mm)flipper_length_mm)
    • You need to standardize categories (e.g., "Good" and "GOOD""good")
    • Etc.
  • You are preparing data for modeling or visualization
    • You want to recode "FEMALE"/"MALE" to 0/1 for a regression model
    • Etc.

We work with the Penguins (raw) Data!

# Load libraries and data
library(tidyverse)
library(palmerpenguins)
data(penguins)

# Explore data
head(penguins_raw)
tail(penguins_raw)
rbind(head(penguins_raw, 3), tail(penguins_raw, 3))
glimpse(penguins_raw)

Renaming Variables with rename

To change variable names (column names) the most common method is rename()

Change the name of the variable studyName to study_name:

# check before renaming
str(penguins_raw)
penguins_raw %>% select(studyName)

# rename
penguins_raw %>% rename(study_name = studyName)   # new = old

# remember to save to keep changes
p <- penguins_raw %>% rename(study_name = studyName)
p %>% select(study_name)

💻 Practice Renaming Variables

  • Use select() to check the variable Comments in penguins_raw
  • Use rename() to rename Comments to notes
  • Save the result to a new object
  • Use select() to check your result

Once done, copy your code here to share it.

Recoding Variables Method 1: with mutate + recode

To change variable values (usually levels of categorical variables), we learn two methods.

Change the levels of the categorical variable Sex (MALE becomes 1, FEMALE becomes 0) with method 1:

# check before recoding
penguins_raw %>% count(Sex)

# mutate + recode
p <- penguins_raw %>%
  mutate(sex = recode(Sex, "MALE" = 1, "FEMALE" = 0))

# compare
penguins_raw %>% count(Sex)
p %>% count(Sex)

Recoding Variables Method 2: with mutate + case_when

Change the levels of the categorical variable Sex (MALE becomes 1, FEMALE becomes 0) with method 2:

# mutate + case_when
penguins_raw %>%
  mutate(Sex = case_when(Sex == "MALE" ~ 1,
                         Sex == "FEMALE" ~ 0,
                         TRUE ~ NA_real_))
  
# like for method 1 (previous code) save results to keep changes and compare

Note

With case_when() each logical condition ~ value pair acts like if → then:

  • for each row, R looks if the condition is TRUE: “If you find the value MALE in Sex, then convert it to 1”
  • TRUE ~ NA_real_ tells R: “If no previous condition was met, then return NA as as a number”

💻 Practice Recoding Variables

  • Use count() to check the variable Species in penguins_raw
  • Pick method 1 or method 2 to recode the values of that variable into Adelie, Chinstrap, Gentoo
  • Save the result to a new object
  • Use count() to verify both results

Once done, copy your code here to share it.

The Role of mutate in Recoding

rename()

  • Specific use: change column names
  • It changes the column’s name, but leaves the column’s values unchanged

mutate()

  • Many uses: create new columns or modify existing columns values
  • It changes the column’s values, potentially can also rename the column

Note

For recoding, we use mutate() because our first goal is changing the column’s values. We learned two methods:

  • Method 1: mutate(Sex = recode(Sex, "MALE" = 1, "FEMALE" = 0))
  • Method 2: mutate(Sex = case_when(sex == "MALE" ~ 1, sex == "FEMALE" ~ 0))

Rename vs Recode: Syntax Reference

Function What It Changes Syntax + Example Tips
rename() Column names rename(new_name = old_name)
rename(notes = Comments)
No quotes around variable names
recode() Column values recode(variable, "old" = new)
recode(Sex, "MALE" = 1)
Check function doc to see when quotes are needed
case_when() Column values case_when(variable == "old" ~ new)
case_when(Sex == "MALE" ~ 1)
Check function doc to see when quotes are needed

Note

All recoding is typically done inside mutate().

2. Syntactic vs. Non-syntactic Variable Names

Syntactic (Valid) Variable Names in R

Valid Names in R:

  • Use letters, numbers, and the symbols . or _
  • But cannot start with a number or symbol

Examples of Valid Names:

flipper_length_mm
flipper.length.mm
flipper.length_mm     # valid but poor style
FlipperLengthMm       # valid but poor style

Non-syntactic (Invalid) Variable Names in R

What Makes a Name Invalid:

  • Contains spaces or symbols
  • Starts with a number or symbol
  • Uses reserved words (e.g., TRUE, NULL, if, function)
  • Type ?Reserved in the Console for the full list

Examples of Invalid Names:

Flipper Length (mm)
@_flipper_length_mm
flipper_ length_mm
flipper-length-mm
.flipper.length.mm

💻 Practice: Syntactic and Non-Syntactic Names

Which of the following are valid names?

  • 3_religion
  • #3_religion
  • q3_religion
  • q3.religion
  • q3-religion
  • q3 religion
  • TRUE

Tip

For best coding style, use snake_case for all your variables names and keep them to three words maximum. Example: q3_religion

How to Handle Non-syntactic Names — and Why It Matters

You should avoid creating non-syntactic names, BUT you’ll often encounter them, especially in datasets not created in R (from Excel or other external sources). If you don’t handle them properly, R will throw errors when you try to use them.

What to Do:

1. Use backticks to refer to them (e.g., `Flipper Length (mm)`)
2. Use rename() to change them to syntactic names

Non-syntactic names will break code if you forget to wrape them in backticks, so renaming avoids issues.

Working with Non-syntactic Names in Practice

Imagine you are working on political ideology by country, assembled by someone else. The data are in Excel and when you imported them in R they look like this:

df <- tibble(country = c("Italy", "Germany", "France", "Italy", "United States"),
                  `4 ideology` = c("communism", "fascism", "anarchism", "fascism", "capitalism"))


To use the non-syntactic variable name without changing it and without errors, you must use backticks:

select(df, `4 ideology`)


💻 Practice: Syntactic & Non-syntactic Variable Names

Try this in R:

  • Use glimpse(penguins_raw) or str(penguins_raw) and identify non-syntactic variables names in this raw dataset
  • Pick one of them, and try accessing it with select() without backticks: what happens?
  • Use rename() to give the variable a syntactic valid name
  • Save the result to a new object
  • Verify the name was changed and you can now access it

Once done, copy your code here to share it.

3. Missing Data

What Are Missing Data?

R distinguishes two types of missing data:

  • Explicit missing: visible NA or NaN values in the dataset
  • Implicit missing: data that was never recorded

In this course, we focus on explicit missing data. For implicit missing data, see R for Data Science Chapter 18

Note

Explicit = value is missing as NA (Not Available) or NaN (Not a Number)
Implicit = value was never recorded (row or cell is absent)

How Missing Data Behave

Any operation involving a missing value will also return a missing value (see Chapter 12.2.2 Missing values for more):

NA > 5

sum(c(3, 1, 4, NA)
sum(c(3, 1, 4, NA), na.rm = TRUE)

mean(c(3,1,4,NA))
mean(c(3, 1, 4, NA), na.rm = TRUE)  

Common Ways to Handle Missing Data

We’ll review three main tools:

  1. is.na() – to detect missing values
  2. na.rm = TRUE – to ignore missing values
  3. drop_na() – to remove missing values

1. Detect Missing Data with is.na():

Use is.na() to find the missing values in a specific variable. It returns TRUE for missing values, and FALSE otherwise.

Check for missing values in the penguins_raw dataset:

# using base R syntax
sum(is.na(penguins_raw$Sex))
table(is.na(penguins_raw$Sex))

# using tidyverse syntax
penguins_raw %>% summarize(sum(is.na(Sex)))
penguins_raw  %>% count(is.na(Sex)) 

# filter rows where sex is missing
filter(penguins_raw, is.na(Sex))    # correct  
filter(penguins_raw, Sex == NA)     # incorrect

2. Ignore Missing Data with na.rm = TRUE

Use na.rm = TRUE to exclude missing values when performing calculations. Often used with summarize() when calculating things like mean, sum, standard deviation.

penguins_raw %>% summarize(avg_mass = mean(`Body Mass (g)`, na.rm = TRUE))
penguins_raw %>% summarize(sum_mass = sum(`Body Mass (g)`, na.rm = TRUE))


Tip

The command na.rm = TRUE does not remove missing data from the variable(s), it just skips them for that operation, but they are not dropped!


3. Remove Missing Data with drop_na()

Use drop_na() to remove rows with missing values. Either across all columns or in a specific column.

Drop missing values in one specific column (preferred):

penguins_raw %>%
  drop_na(`Body Mass (g)`) %>%
  summarize(avg_mass = mean(`Body Mass (g)`))


Warning

Be careful with drop_na() as it removes entire rows, which may unintentionally filter out relevant data. Check which variable(s) you are dropping, and avoid using it blindly across all columns.


💻 Practice: Handling Missing Data

  1. Rename: use the penguins_raw dataset and rename Flipper Length (mm) to flipper_length_mm. Save the result as a new dataframe, e.g., penguins_clean or p

Use the new dataframe with the renamed variable for the tasks below:

  1. Detect missing values: use is.na() and sum() to count how many are missing in the variable flipper length

  2. Exclude missing from calculations: use na.rm = TRUE inside mean() to calculate the average flipper length

  3. Drop missing values: use drop_na() to remove rows with missing values in flipper length

Once done, share your code here.

Ways to Fill or Replace Missing Data

Main functions to replace or fill missing values:

  • replace_na() – replace missing values with a specified value
  • fill() – carry values forward or backward (from the package tidyr)
  • coalesce() – return the first non-missing value across multiple columns

See Chapter 18 of R for Data Science for more.

Recap: What We Learned Today

  • How to rename variables using rename()
  • How to recode values inside a variable using recode() and case_when()
  • The difference between syntactic and non-syntactic variable names, and how to handle them
  • How to detect, ignore, or drop missing values using:
    • is.na() to detect
    • na.rm = TRUE to ignore
    • drop_na() to remove

To print these slides as pdf

Click on the icon bottom-right corner > Tools > PDF Export Mode > Print as a Pdf