This handout accompanies the workshop given on August 21, 2019 at UGA’s DigiLab in the Main Library. There is substantial overlap with previous workshops on ggplot2, though the use of the Stranger Things dataset is new as is the bar plots with the girl names data (thanks John Hale for the last minute recommendation!). Please visit joeystanley.com/r for the latest materials.


This is the first of three workshops in the Data Visualization series devoted to ggplot2. This workshops will cover how to do some basic plots. The next one will explore more of the ggplot2 syntax and see how to modify aspects of your plot like the colors and how to reorder things. After that, we’ll dive into more advanced topics, and look at how to change the overall “theme” of your plot, including how to add custom themes to match your powerpoint slides.

The goal for this series is not to cover every aspect of ggplot2. Instead, I hope to expose you to some basic code with the hopes that you leave being able to apply this code to your own data.

To get the most out of this workshops, it is expected that you have some experience with R. I don’t expect you to be a pro, but I’m assuming you have been able to get your data into R, you’ve run some functions, and that you’re familiar with the basics.

1 The basics

1.1 Downloading and Installation

ggplot2 does not come standard with R, so you’ll have to install it to your computer. Luckily, this is pretty straightforward and can be done just like any other R package.

install.packages("ggplot2") # If you haven't done so already.

Again, you only need to do this once, unless you want to update the package. If you have ggplot2 already installed on your computer, it might be worth it to reinstall it anyway because ggplot2 3.0 was released in July 2018 and it’s a good idea to get the latest version.

What you do need to do every time you run R is to load the package using the library() function. Go ahead and do that now.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2

1.2 Data for this workshop

There are two datsetsa that we’ll be working with in this workshop.

1.2.1 Stranger Things

The first is a spreadsheet of basic information about each episode of Stranger Things. The information is available from IMDB, where you can get data for over 5,000 movies and TV shows. I’ve gone ahead and done some simple prep already (like isolating just the Stranger Things episoes) so it should be easy to work with. You can read in this file directly from there into R like this:

stranger <- read.csv("http://joeystanley.com/data/stranger.csv")

Let’s inspect this dataset just so we have a better idea of what it looks like.

View(stranger)
summary(stranger)
##                  title        season     episode         rating   
##  Dig Dug            : 1   Min.   :1   Min.   :1.00   Min.   :6.1  
##  E Pluribus Unum    : 1   1st Qu.:1   1st Qu.:3.00   1st Qu.:8.5  
##  Holly, Jolly       : 1   Median :2   Median :5.00   Median :8.8  
##  MADMAX             : 1   Mean   :2   Mean   :4.68   Mean   :8.7  
##  Suzie, Do You Copy?: 1   3rd Qu.:3   3rd Qu.:7.00   3rd Qu.:9.0  
##  The Bathtub        : 1   Max.   :3   Max.   :9.00   Max.   :9.4  
##  (Other)            :19                                           
##      votes          minutes     
##  Min.   :10309   Min.   :41.00  
##  1st Qu.:11693   1st Qu.:48.00  
##  Median :13148   Median :51.00  
##  Mean   :13485   Mean   :52.08  
##  3rd Qu.:14909   3rd Qu.:55.00  
##  Max.   :19185   Max.   :77.00  
## 

So we have the title of the episode, the the season number, the episode number, its average rating on IMDB, the number of votes to produce that rating, and how long the episode was in minutes.

As one small bit of prep, right now the season number is being treated as a number, when we really want it as a factor. Let’s change that really quickly:

stranger$season <- factor(stranger$season)
summary(stranger)
##                  title    season    episode         rating   
##  Dig Dug            : 1   1:8    Min.   :1.00   Min.   :6.1  
##  E Pluribus Unum    : 1   2:9    1st Qu.:3.00   1st Qu.:8.5  
##  Holly, Jolly       : 1   3:8    Median :5.00   Median :8.8  
##  MADMAX             : 1          Mean   :4.68   Mean   :8.7  
##  Suzie, Do You Copy?: 1          3rd Qu.:7.00   3rd Qu.:9.0  
##  The Bathtub        : 1          Max.   :9.00   Max.   :9.4  
##  (Other)            :19                                      
##      votes          minutes     
##  Min.   :10309   Min.   :41.00  
##  1st Qu.:11693   1st Qu.:48.00  
##  Median :13148   Median :51.00  
##  Mean   :13485   Mean   :52.08  
##  3rd Qu.:14909   3rd Qu.:55.00  
##  Max.   :19185   Max.   :77.00  
## 

There we go.

1.2.2 McDonald’s menu items

The second dataset that we’ll be working with is a spreadsheet of McDonald’s menu items. This file contains some nutritional information such as calories, fat, and sugars, as well as the item name and category. It is available for free at Kaggle.com, where you can get complete nutritional information. I’ve got a subset of this data on my website, so you can just read in this file directly from there into R like this:

menu <- read.csv("http://joeystanley.com/data/menu.csv")

Let’s inspect this dataset just so we have a better idea of what it looks like.

View(menu)
summary(menu)
##                Category                                       Item    
##  Coffee & Tea      :95   1% Low Fat Milk Jug                    :  1  
##  Breakfast         :42   Apple Slices                           :  1  
##  Smoothies & Shakes:28   Bacon Buffalo Ranch McChicken          :  1  
##  Beverages         :27   Bacon Cheddar McChicken                :  1  
##  Chicken & Fish    :27   Bacon Clubhouse Burger                 :  1  
##  Beef & Pork       :15   Bacon Clubhouse Crispy Chicken Sandwich:  1  
##  (Other)           :26   (Other)                                :254  
##        Oz            Calories           Fat              Sugars      
##  Min.   : 1.000   Min.   :   0.0   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.: 6.775   1st Qu.: 210.0   1st Qu.:  2.375   1st Qu.:  5.75  
##  Median :12.000   Median : 340.0   Median : 11.000   Median : 17.50  
##  Mean   :12.803   Mean   : 368.3   Mean   : 14.165   Mean   : 29.42  
##  3rd Qu.:16.000   3rd Qu.: 500.0   3rd Qu.: 22.250   3rd Qu.: 48.00  
##  Max.   :32.000   Max.   :1880.0   Max.   :118.000   Max.   :128.00  
## 

1.2.3 Top 25 Girl Names in 2017

The last dataset we’ll use in this workshop is a small one that lists the top 25 most common baby girl names in the US in 2017. The data comes from the US census and I created it with the help of the babynames package. It’s also stored on my website, but this time, it’s a tab-delimited file.

I made the filetype a little different on purpose to quickly show how to read other file formats in. Here, instead of read.csv, I’ll use read.delim, with the added argument that the thing that separates the columns is a tab, which is represented as " in R.

girlnames <- read.table("http://joeystanley.com/data/girlnames.txt", header = TRUE)
girlnames
##         name     n
## 1       Emma 19738
## 2     Olivia 18632
## 3        Ava 15902
## 4   Isabella 15100
## 5     Sophia 14831
## 6        Mia 13437
## 7  Charlotte 12893
## 8     Amelia 11800
## 9     Evelyn 10675
## 10   Abigail 10551
## 11    Harper 10451
## 12     Emily  9746
## 13 Elizabeth  8915
## 14     Avery  8186
## 15     Sofia  8134
## 16      Ella  8014
## 17   Madison  7847
## 18  Scarlett  7679
## 19  Victoria  7267
## 20      Aria  7132
## 21     Grace  6991
## 22     Chloe  6912
## 23    Camila  6752
## 24  Penelope  6639
## 25     Riley  6343

When you use ggplot2 to make visualizations of your own data, you’ll have to load it in and make sure it’s clean and tidy just like the sample datasets are. I won’t go over how to tidy your databut a key part of creating good visualizations is good data.

1.3 Blank plots

Okay, finally, we’re ready to plot! The main function in ggplot2 is the ggplot function. In fact, you call this function without any arguments, and it’s still valid R code.

ggplot()

When you run this line of code, in the bottom right panel of RStudio, the Plots tab is selected and this new visual appears. It doesn’t do much other than produce a gray rectangle though. In fact, it’s a coordinate system without any axes. This is the base layer that everything else gets added on top of. But it’s important to see what your blank canvas is, so to speak.

The first argument in the ggplot function is the data argument. To make a visualization of a particular dataset, just add data = plus the name of your dataset. We’ll use the stranger dataset that you should have downloaded earlier.

ggplot(data = stranger)