Introduction to R: Lecture 5

Topics: Data Analysis and Graphs

Sabrina Nardin, Summer 2025

Agenda

The scorecard Dataset
Match Graph Types to Variable Types
Practice

These slides were last updated on July 24, 2025

Our Goals Today

Practice using graphs for data analysis. Specifically:

Display variation and co-variation: Learn how to visualize the distribution of a single variable (e.g., cost to attend a school) and of two or more variables (e.g., cost and admrate).
Match graph type to variable type: Choose appropriate graph types depending on whether variables are categorical (e.g., school type), continuous (e.g., cost), or both.
Interpret the graph: Practice describing what a graph reveals — such as trends, group differences, and outliers.

1. The scorecard Dataset

About the scorecard dataset

The U.S. Department of Education collects annual statistics on colleges and universities in the United States: https://collegescorecard.ed.gov/data

This dataset includes variables such as:

name: name of the school
state: state where the school is located
type: school type (e.g., Public, Private Nonprofit, Private For-Profit)
admrate: admission rate (e.g., 0.91 = 91%)
cost: published cost of attendance
netcost: net cost of attendance after financial aid
satavg: average SAT score of admitted students
...: many additional variables

We’ll focus on a subset of this data from the 2018–2019 academic year.

About the scorecard dataset

library(tidyverse)
library(rcis)
data(scorecard)
glimpse(scorecard)

Rows: 1,732
Columns: 14
$ unitid    <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009…
$ name      <chr> "Alabama A & M University", "University of Alabama at Birmin…
$ state     <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
$ type      <fct> "Public", "Public", "Public", "Public", "Public", "Public", …
$ admrate   <dbl> 0.9175, 0.7366, 0.8257, 0.9690, 0.8268, 0.9044, 0.8067, 0.53…
$ satavg    <dbl> 939, 1234, 1319, 946, 1261, 1082, 1300, 1230, 1066, NA, 1076…
$ cost      <dbl> 23053, 24495, 23917, 21866, 29872, 19849, 31590, 32095, 3431…
$ netcost   <dbl> 14990, 16953, 15860, 13650, 22597, 13987, 24104, 22107, 2071…
$ avgfacsal <dbl> 69381, 99441, 87192, 64989, 92619, 71343, 96642, 56646, 5400…
$ pctpell   <dbl> 0.7019, 0.3512, 0.2536, 0.7627, 0.1772, 0.4644, 0.1455, 0.23…
$ comprate  <dbl> 0.2974, 0.6340, 0.5768, 0.3276, 0.7110, 0.3401, 0.7911, 0.69…
$ firstgen  <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381…
$ debt      <dbl> 15250, 15085, 14000, 17500, 17671, 12000, 17500, 16000, 1425…
$ locale    <fct> City, City, City, City, City, City, City, City, City, Suburb…

The scorecard dataset: making a plot

Which type of college has the highest average SAT score?

# with faceted histogram
ggplot(data = scorecard, mapping = aes(x = satavg)) +
  geom_histogram() + 
  facet_wrap(facets = vars(type))

What’s a histogram?

Which type of college has the highest average SAT score?

# with a boxplot
ggplot(data = scorecard, mapping = aes(x = type, y = satavg)) +
  geom_boxplot()

What’s a boxplot?

The scorecard dataset: interpreting a plot

What do these graphs reveal about average SAT scores by type of college?

Interpreting plots is as important as writing code

According to these graphs, private nonprofit schools have the highest average SAT scores, closely followed by public schools, and then private for-profit schools.

But this interpretation doesn’t tell the full story:

From the histogram, we can see that each school type includes a different number of colleges. i.e., they have different sample sizes (private for-profit schools have far fewer schools than others).
This matters because averages based on small groups may not represent the broader category well. Looking at the full distribution gives us a more complete understanding.

The scorecard dataset: asking more questions, answering them with dplyr

How many schools are in each type?

scorecard %>% count(type)

Which schools are categorized as private for-profit?

scorecard %>%
  filter(type == "Private, for-profit") %>%
  select(name, state, type, satavg, cost)

What about the University of Chicago?

scorecard %>% filter(name == "University of Chicago")

Etc.

Once you draw a plot, ask yourself:

Substantive questions:

What does this graph tell?
Are there patterns? Outliers?
What hypotheses can I generate?
What else I want to know?
Should I dig deeper with dplyr?
Etc.

Stylistic questions:

Is the chosen plot appropriate (match variable type)?
Is the plot clear and easy to understand?
Is it too busy or too simple?
Could it be improved? (titles, labels, colors, etc.)
Does I need to manipulate the data first?
Etc.

2. Matching Graph Types to Variable Types

Types of Visualizations and Best Graph Types

Do I want to represent variation in:

A single variable?
Two variables?
Three variables?

What type(s) are my variables?

Continuous (e.g., satavg average SAT score, cost published cost)
Categorical (e.g., type school type, state U.S. state)
Other types (often treated as categorical): ordinal, nominal, binary

Tip

Think about which variables you want to display and their type before choosing a graph!

Univariate (One Variable)

To show how values vary within a single variable:

One continuous variable → Histogram
One categorical variable → Bar Chart

Function	What Gets Counted	When to Use
`geom_histogram()`	How many values fall into each numeric bin	When `x` is continuous
`geom_bar()`	How many observations are in each category	When `x` is categorical

Bivariate (Two Variables)

To show how two variables co-vary:

Two continuous variables → Scatterplot
One categorical + one continuous → Box Plot
Two categorical variables → Grouped or Stacked Bar Chart

Multivariate (Three Variables)

To compare patterns across subgroups:

One categorical + two continuous → Faceted Scatterplot
Two categorical + one continuous → Grouped Box Plot

These are the most common combinations. There are more options as you explore deeper!

Main Graph Types and When to Use Them

Type of Graph	`ggplot2` Function	Input Variables	Goal
Histogram	geom_histogram()	One continuous (e.g., age, income)	Show the distribution of values
Bar Chart	geom_bar()	One categorical (e.g., region, gender)	Show frequencies or counts of categories
Stacked Bar Chart	geom_bar() + fill	Two categorical variables (e.g., region by gender)	Compare parts of a whole across categories
Scatterplot	geom_point()	Two continuous variables (e.g., height vs. weight, price vs. rating)	Show relationship or correlation
Box Plot	geom_boxplot()	One continuous + one categorical (e.g., income by gender)	Compare distributions, spot outliers
Faceted Scatterplot	facet_wrap() + geom_point()	Two continuous + one categorical (e.g., by country or year)	Compare patterns across groups

3. Practice

💻 Practice

On the next slide, you’ll see a set of tasks. In small groups, use the scorecard dataset to create the most appropriate graph for each one.

Before plotting: Consider the type of variable and the type of variation you need to represent. Use the slides as reference.
While plotting: Keep it simple, as you would for an initial Exploratory Data Analysis (e.g., no need to add labels, legends, color adjustments, scales, themes, facets, etc.)
After plotting: Stare at the graph… look for patterns, outliers, or any notable features, and substantively interpret the graph.

💻 Practice

Share your solutions here https://codeshare.io/vAzK44

TASK 1: Plot the annual total cost of school attendance across the U.S. Hint: try geom_histogram() with the variable cost

TASK 2: Plot the total number of schools in the U.S. by school type. Hint: try geom_bar() with the variable variable type

TASK 3: Plot the annual total cost and net cost of attendance to schools in the U.S. (variables cost and netcost)

TASK 4: Plot the annual total cost of attendance by school type (variables cost and type)

TASK 5: Plot the annual total cost of attendance and net cost of attendance by school type (variables cost, netcost, type)

💻 Download Today Materials

Download today’s class materials from our website for:

further insights into these tasks (and solutions!)
additional practice exercises

Takeaways and Tips

Follow this approach to move from quick exploration to polished, communicative plots:

When exploring your data:

Start with simple and plots to get a sense of the distribution and relationships
Focus on interpretation first, not style
Ask questions: What stands out? Are there outliers, trends, or surprises? What else do you want to know?

Once you settle on a plot:

Refine the code to improve readability and aesthetics
Add styling elements (labels, scales, legends, and themes. etc. as needed)
Use polished plots for assignments, reports, or presentations

Typically, researchers create many exploratory plots and only a few make it to the final report.

Recap: What We Learned Today

Pick graph types that match the types of variables in your data
Read and interpret visualizations
Pracice using ggplot and dplyr

To print these slides as pdf

Click on the icon bottom-right corner > Tools > PDF Export Mode > Print as a Pdf