Introduction to R: Lecture 3

Topics: Intro to ggplot2, Coding Style

Sabrina Nardin, Summer 2025

Agenda

  1. Intro to the Grammar of Graphics and ggplot2
  2. R Coding Style: Best Practices
  3. More on the Grammar of Graphics and ggplot2
  4. Simplifying ggplot2 with Defaults
  5. Practice: Gapminder Data

Reminders

  • Check your working directory inside R Workbench (“Project:”) and switch project if necessary

  • Check your R version

    • R 4.5.1 use this!
    • R 4.2.0

1. Intro to the Grammar of Graphics and ggplot2

What Is a Dataset?

Imagine you are a data analyst for Ford and have information on ~200 cars. For each car, you know:

  • How big the engine is (engine size or displacement, in liters)
  • How many miles it gets on the highway (fuel efficiency, in Miles Per Gallons)
  • The class of the car (e.g., compact, SUV, etc.)

Questions for you:

  • What do you think a dataset like this might look like?
  • What is a dataset, anyway?
  • What does one row represent? What does one column represent?

Plotting Data

Imagine you pick two pieces of information (e.g., variables) from this car dataset:

  • variable 1: car fuel efficiency on the highway
  • variable 2: car engine size (engine displacement)

You want to understand the relationship between these two variables visually, so you start by creating a simple scatter plot…

Our First Plot with ggplot2

Before we explain the code to make this plot…

  • What’s on the x-axis? And on the y -axis?
  • What do you think each dot in this scatterplot represents?
  • If you wanted to pick a car that combines high fuel efficiency on the highway and a large engine size, where would you look on this plot? Are there cars like that here?
  • Larger engines tend to have __________ highway MPG efficiency

Understanding a Dataset: mpg

Open R Workbench and use R version 4.5.1

Copy/paste this code into your R Console and run it for a preview of the mpg dataset:

library(ggplot2)
data(mpg)
head(mpg)

Installing and Loading Packages in R

To use external tools like ggplot2, you need to first install and then load the package. A package is a a collection of code, data, and documentation.

Install a package. Only once per computer:

install.packages("ggplot2")

Load a package. Every time you use it, at the top of the script:

library(ggplot2)

In this course, we’re using RStudio Workbench, where everything is already installed. So you only need to load packages.

Key Dataset Terms

  • Observation = each row = one car
  • Variable = each column = a car property (e.g., displ, hwy, class)
    • displ: engine size or displacement, in liters
    • hwy: fuel efficiency, highway miles per gallon
    • class: class or type of car (compact, SUV…)

The mpg dataset has 234 observations (cars, in rows) and 11 variables (properties, in columns)

Code to Make our First Plot with ggplot2

library(ggplot2)

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

What this code does:

  • Sets the data: mpg
  • Makes a scatterplot: geom_point()
  • Tells what to put on the axes: aes()
  • Maps the variable engine size to x-axis: x = displ
  • Maps the variable fuel efficiency to y-axis: y = hwy

Add Title and Axes Labels

Our initial plot also had a Title and Axes Labels, that’s the full code to add them (copy/paste it to your Console):

library(ggplot2)

ggplot(data = mpg) +
  geom_point(aes(x = displ, y = hwy)) +
  labs(title = "Car Fuel Efficiency on the Highway vs. Car Engine Size",
       x = "Car Engine Size in Liters",
       y = "Car Highway MPG (Fuel Efficiency)")

Why Visualize Data?

Before you run numbers or models, it’s useful to look at the data visually!

  • Graphs help you spot patterns, relationships, and outliers
  • Visualization makes data more understandable and shareable
  • You can explore your questions more quickly

💻 Practice

Team up with someone, run this code in your Console, and answer the questions:

library(ggplot2)
data(mpg)

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, color = class))
  • What do you see? Describe the plot to your partner
  • What do the first two lines of code do?
  • What does color = class do?
  • Replace the current x variable with the variable cty (city miles per gallons), run the code, and describe the new plot

Grammar of Graphics and ggplot2

We’ve been creating plots intuitively. Now, let’s learn them formally.

  • ggplot2 is the main R package used for data visualization
  • It’s part of the tidyverse, a collection of packages for data science: https://www.tidyverse.org/
  • Created by Hadley Wickham, who also co-authored your course textbook
  • Built using the theory called the Grammar of Graphics, a system for creating layered plots

Grammar and “Grammar of Graphics”

A grammar is a set of rules (syntax and morphology) that helps us structure a language. It lets us communicate clearly.

Applied to R and ggplot2…

A Grammar of Graphics is a set of rules for building data visualizations. It lets us create many types of plots using the same structure.

Main Components of the Grammar of Graphics

The Grammar of Graphics defines a plot as built from five main parts.

These five parts together are called a layer:

  • DATA: the dataset you are using
  • GEOM: the type of plot (e.g., points, bars, lines)
  • MAPPING: maps variables to aesthetics like x, y, color, etc. with aes()
  • STAT: whether the data should be transformed (e.g., counted) or not (identity)
  • POSITION: how things are arranged on the plot (e.g., stacked or jittered)

Grammar of Graphics: Code Template

Let’s look at how these five parts of a layer show up in ggplot2 code:

# code template

ggplot(data = <DATA>) +
  <GEOM>(
    mapping = aes(<MAPPING>),
    stat = <STAT>,
    position = <POSITION>
  )

You can also add more elements, like:

+ <COORDINATE SYSTEM>
+ <FACET>

Grammar of Graphics: Code Template with the mpg Dataset

Now, fill in this code template using the mpg data and plot it:

# code template

ggplot(data = <DATA>) +
  <GEOM>(
    mapping = aes(<MAPPING>),
    stat = <STAT>,
    position = <POSITION>
  ) +
  <COORDINATE> +
  <FACET>
# filled with mpg data

ggplot(data = mpg) +
  geom_point(
    mapping = aes(x = displ, y = hwy, color = class),
    stat = "identity",
    position = "identity"
  ) +
  coord_cartesian() +
  facet_wrap(vars(class), nrow = 1)

2. R Coding Style: Best Practices

On Coding Style in R

Before continuing with the Grammar of Graphics, let’s take a moment to focus on coding style (parenthesis, signs, indentation, variable names, etc.)

Here’s the same code as before. This time, read it focusing on its style. What do you notice?

ggplot(data = mpg) + 
  geom_point(
     mapping = aes(x = displ, y = hwy, color = class),
     stat = "identity", 
     position = "identity") +
  coord_cartesian() +
  facet_wrap(facets = vars(class), nrow = 1)

Coding Style Checklist

  • Use <- to assign values to variables
  • Use = to assign values to function arguments
  • Use descriptive, meaningful variable names (snake_case, lowercase)
  • Add spaces around operators (e.g., +, -, <-) for readability
  • Avoid spaces inside parentheses or brackets
  • Close every opened parenthesis
  • Break long lines and use indentation
  • Use comments, but do not abuse them

Why Care About Coding Style?

Resources to guide you:

Make and Store a Plot

This code makes a plot, but does not store it

 ggplot(data = mpg) +
  geom_point(
    mapping = aes(x = displ, y = hwy, color = class),
    stat = "identity",
    position = "identity"
  ) +
  coord_cartesian() +
  facet_wrap(vars(class), nrow = 1)

This code makes a plot and stores it to an object that I named fuel_plot

fuel_plot <- ggplot(data = mpg) +
  geom_point(
    mapping = aes(x = displ, y = hwy, color = class),
    stat = "identity",
    position = "identity"
  ) +
  coord_cartesian() +
  facet_wrap(vars(class), nrow = 1)
fuel_plot

3. More on the Grammar of Graphics and ggplot2

Core Components of ggplot2

You’ve seen the code template and the big picture. Now we’ll take a closer look at key components:

  • Layers
    • data
    • mapping
    • geom
    • stat
    • position
  • Coordinate System
  • Facets
  • Scales (not covered today)

Layers

Layers create the visual elements on a plot. Each layer has five parts:

ggplot(data = mpg) + 
  geom_point(
    mapping = aes(x = displ, y = hwy, color = class),
    stat = "identity", 
    position = "identity"
  ) + 
  geom_smooth(
    mapping = aes(x = displ, y = hwy),
    method = "lm"
  )

Data and Aesthetic Mappings

Data:

  • The dataset used for the plot
  • Example: data = mpg

Mapping:

  • Links variables to aesthetics (x, y, color, size…)
  • Example: aes(x = displ, y = hwy, color = class)

How ggplot2 Controls Positioning

Adjusts how elements are positioned to avoid overlap.

  • Scatterplot: often position = "identity"
  • Jitter: position = "jitter" spreads overlapping points
  • Bar plot: use "stack", "dodge", etc.

Tip

Check geom_*() documentation to learn supported positions.

Statistical Transformations in Layers

Stat: transforms your data for plotting.

  • Bar charts: geom_bar() uses stat = "count" by default
  • Scatterplots: often use stat = "identity"
ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = class)) # implicitly counts

Choosing the Right Geometric Object

The geometric object is the type of plot you’re making:

  • geom_point() → scatterplot
  • geom_line() → line chart
  • geom_bar() → bar chart

Each geom accepts certain aesthetics. Read the docs: geom_point reference

💻 Practice: Color Aesthetics

Which of these three code blocks is correct? Run and compare them in R.


ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))


ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")


ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

💻 Practice: Color Aesthetics

✅ Correct

aes(color = class)

Maps a variable.

✅ Correct

color = "blue"

Sets color manually (outside aes).

❌Incorrect

aes(color = "blue")

Tries to map a fixed value (the string “blue”). Not useful.

Setting vs Mapping in ggplot2

  • Setting means assigning a fixed value outside aes()
    Example: color = "blue" (does not come from the data)

  • Mapping means linking a variable inside aes()
    Example: aes(color = class) (comes from the data, from a variable)

Common Beginner Mistakes

  • Putting fixed values like “blue” inside aes() (e.g., aes(color = "blue")
  • Forgetting to close aes() or geom_() parentheses
  • Using <- inside aes() instead of =

Coordinate System and Faceting

Coordinate System:

  • Translates data into the plot space
  • Default: Cartesian system (coord_cartesian())

Faceting:

  • Splits data into subplots by a variable
ggplot(data = mpg) + 
  geom_point(aes(x = displ, y = hwy, color = class)) +
  coord_cartesian() +
  facet_wrap(vars(class), nrow = 1)

4. Simplifying ggplot2 with Defaults

Example 1

Example from the previous slides

Long version:

ggplot(data = mpg) + 
  geom_point(
    mapping = aes(x = displ, y = hwy, color = class),
    stat = "identity", 
    position = "identity") + 
  geom_smooth(
    mapping = aes(x = displ, y = hwy),
    method = "lm")

Short version:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) + 
  geom_smooth(method = "lm")

Example 2

Scatterplot between cars’ engine size (displ) and highway fuel efficiency (hwy)

Long version:

ggplot() +
  layer(
    data = mpg, 
    mapping = aes(x = displ, y = hwy),
    geom = "point", 
    stat = "identity", 
    position = "identity"
  ) +
  scale_y_continuous() +
  scale_x_continuous() +
  coord_cartesian()

Short version:

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point()

Defaults Cheat Sheet

Use these defaults to simplify code:

  • stat = "identity" for geom_point()
  • position = "identity" unless changed
  • Common aes() calls can go in ggplot()
  • scale_*_continuous() not needed unless customizing

Quick Review

  1. What does geom_point() do?
  2. What does stat = "identity" mean?
  3. When would you use facet_wrap()?
  4. What’s wrong with aes(color = "blue")?

Try answering these without peeking at the slides!

5. Pratice: Gapminder Data

💻 Practice: Gapminder

Download today’s in-class materials from the website!

The gapminder dataset:

Contains data on various socio-economic indicators for countries around the world over multiple years (1957-2008). It includes information on life expectancy, GDP per capita, and population

Gapminder info: https://cran.r-project.org/web/packages/gapminder/readme/README.html and https://www.gapminder.org/

Recap: What We Learned Today

  • Core components of the Grammar of Graphics
  • Generate simple graphs with ggplot2
  • Use the Grammar of Graphics template and simplify it with defaults
  • R Coding Style Best Practices

Reminders

  • Homework 1 is due Wednesday, July 2nd
  • There is no class Thursday, July 3rd

To print these slides as pdf

Click on the icon bottom-right corner > Tools > PDF Export Mode > Print as a Pdf