This handout accompanies the workshop given on August 21, 2019 at UGA’s DigiLab in the Main Library. There is substantial overlap with previous workshops on ggplot2, though the use of the Stranger Things dataset is new as is the bar plots with the girl names data (thanks John Hale for the last minute recommendation!). Please visit joeystanley.com/r for the latest materials.
This is the first of three workshops in the Data Visualization series devoted to ggplot2. This workshops will cover how to do some basic plots. The next one will explore more of the ggplot2 syntax and see how to modify aspects of your plot like the colors and how to reorder things. After that, we’ll dive into more advanced topics, and look at how to change the overall “theme” of your plot, including how to add custom themes to match your powerpoint slides.
The goal for this series is not to cover every aspect of ggplot2. Instead, I hope to expose you to some basic code with the hopes that you leave being able to apply this code to your own data.
To get the most out of this workshops, it is expected that you have some experience with R. I don’t expect you to be a pro, but I’m assuming you have been able to get your data into R, you’ve run some functions, and that you’re familiar with the basics.
ggplot2 does not come standard with R, so you’ll have to install it to your computer. Luckily, this is pretty straightforward and can be done just like any other R package.
install.packages("ggplot2") # If you haven't done so already.
Again, you only need to do this once, unless you want to update the package. If you have ggplot2 already installed on your computer, it might be worth it to reinstall it anyway because ggplot2 3.0 was released in July 2018 and it’s a good idea to get the latest version.
What you do need to do every time you run R is to load the package using the library()
function. Go ahead and do that now.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
There are two datsetsa that we’ll be working with in this workshop.
The first is a spreadsheet of basic information about each episode of Stranger Things. The information is available from IMDB, where you can get data for over 5,000 movies and TV shows. I’ve gone ahead and done some simple prep already (like isolating just the Stranger Things episoes) so it should be easy to work with. You can read in this file directly from there into R like this:
stranger <- read.csv("http://joeystanley.com/data/stranger.csv")
Let’s inspect this dataset just so we have a better idea of what it looks like.
View(stranger)
summary(stranger)
## title season episode rating
## Dig Dug : 1 Min. :1 Min. :1.00 Min. :6.1
## E Pluribus Unum : 1 1st Qu.:1 1st Qu.:3.00 1st Qu.:8.5
## Holly, Jolly : 1 Median :2 Median :5.00 Median :8.8
## MADMAX : 1 Mean :2 Mean :4.68 Mean :8.7
## Suzie, Do You Copy?: 1 3rd Qu.:3 3rd Qu.:7.00 3rd Qu.:9.0
## The Bathtub : 1 Max. :3 Max. :9.00 Max. :9.4
## (Other) :19
## votes minutes
## Min. :10309 Min. :41.00
## 1st Qu.:11693 1st Qu.:48.00
## Median :13148 Median :51.00
## Mean :13485 Mean :52.08
## 3rd Qu.:14909 3rd Qu.:55.00
## Max. :19185 Max. :77.00
##
So we have the title of the episode, the the season number, the episode number, its average rating on IMDB, the number of votes to produce that rating, and how long the episode was in minutes.
As one small bit of prep, right now the season number is being treated as a number, when we really want it as a factor. Let’s change that really quickly:
stranger$season <- factor(stranger$season)
summary(stranger)
## title season episode rating
## Dig Dug : 1 1:8 Min. :1.00 Min. :6.1
## E Pluribus Unum : 1 2:9 1st Qu.:3.00 1st Qu.:8.5
## Holly, Jolly : 1 3:8 Median :5.00 Median :8.8
## MADMAX : 1 Mean :4.68 Mean :8.7
## Suzie, Do You Copy?: 1 3rd Qu.:7.00 3rd Qu.:9.0
## The Bathtub : 1 Max. :9.00 Max. :9.4
## (Other) :19
## votes minutes
## Min. :10309 Min. :41.00
## 1st Qu.:11693 1st Qu.:48.00
## Median :13148 Median :51.00
## Mean :13485 Mean :52.08
## 3rd Qu.:14909 3rd Qu.:55.00
## Max. :19185 Max. :77.00
##
There we go.
The last dataset we’ll use in this workshop is a small one that lists the top 25 most common baby girl names in the US in 2017. The data comes from the US census and I created it with the help of the babynames
package. It’s also stored on my website, but this time, it’s a tab-delimited file.
I made the filetype a little different on purpose to quickly show how to read other file formats in. Here, instead of read.csv
, I’ll use read.delim
, with the added argument that the thing that separates the columns is a tab, which is represented as " in R.
girlnames <- read.table("http://joeystanley.com/data/girlnames.txt", header = TRUE)
girlnames
## name n
## 1 Emma 19738
## 2 Olivia 18632
## 3 Ava 15902
## 4 Isabella 15100
## 5 Sophia 14831
## 6 Mia 13437
## 7 Charlotte 12893
## 8 Amelia 11800
## 9 Evelyn 10675
## 10 Abigail 10551
## 11 Harper 10451
## 12 Emily 9746
## 13 Elizabeth 8915
## 14 Avery 8186
## 15 Sofia 8134
## 16 Ella 8014
## 17 Madison 7847
## 18 Scarlett 7679
## 19 Victoria 7267
## 20 Aria 7132
## 21 Grace 6991
## 22 Chloe 6912
## 23 Camila 6752
## 24 Penelope 6639
## 25 Riley 6343
When you use ggplot2 to make visualizations of your own data, you’ll have to load it in and make sure it’s clean and tidy just like the sample datasets are. I won’t go over how to tidy your databut a key part of creating good visualizations is good data.
Okay, finally, we’re ready to plot! The main function in ggplot2 is the ggplot
function. In fact, you call this function without any arguments, and it’s still valid R code.
ggplot()
When you run this line of code, in the bottom right panel of RStudio, the Plots tab is selected and this new visual appears. It doesn’t do much other than produce a gray rectangle though. In fact, it’s a coordinate system without any axes. This is the base layer that everything else gets added on top of. But it’s important to see what your blank canvas is, so to speak.
The first argument in the ggplot
function is the data
argument. To make a visualization of a particular dataset, just add data =
plus the name of your dataset. We’ll use the stranger
dataset that you should have downloaded earlier.
ggplot(data = stranger)