This handout accompanies the workshop given on August 28, 2019 at UGA’s DigiLab in the Main Library. There is substantial overlap with my previous workshops on ggplot2. Please visit joeystanley.com/r for the latest materials.


1 Introduction

This is the second of three workshops in the Data Visualization series devoted to ggplot2. In the previous workshop, we looked at the basics of data visualization and data types and introduced the library ggplot2. Specifically, we looked at scatterplots and how we can plot shapes, color, size, and text. We looked at plotting one variable, whether it be categorical with a barplot or continuous with a histogram. Finally, we looked at boxplots and violin plots, and started to show how to overlay multiple plots in one image.

This workshop will focus less on the different kinds of plots and instead will show how you can modify things to suit your visual preferences. Specifically, we’ll take a closer look at modifying colors, reordering and renaming categorical variables, adding titles, modifying axes, and making changes to legends. It sounds like a lot, but it shouldn’t be too bad. After reading through this workshop, you’ll go from the basic default plots to something you might want to include in a presentaton or paper.

Finally, the next workshop will explore more of the ggplot2 syntax and see how to modify aspects of your plot like the colors and how to reorder things. After that, we’ll dive into more advanced topics, and look at how to change the overall “theme” of your plot, including how to add custom themes to match your powerpoint slides.

Note: I wasn’t able to include everything I wanted to in this workshop. For the sake of simplicity and to ensure the workshop stayed under an hour, I have cut some of the depth in some of these topics. Instead, I have moved this discussion to the supplemental handout. There is no in-person workshop planned in the foreseeable future that covers these topics, but I wanted to make sure you had access to the material.

1.1 Today’s dataset

Last week we looked at McDonald’s menu items. Today, we’ll look at another dataset made available through Kaggle.com that contain nutritional information of 80 different kinds of cereal. I’ve removed some of the columns to simplify the dataset and made it available on my website, so you can read it in straight from there.

cereal <- read.csv("http://www.joeystanley.com/data/cereal.csv")
summary(cereal)
##                         name    mfr    type       shelf      
##  100% Bran                : 1   G:22   C:74   Min.   :1.000  
##  100% Natural Bran        : 1   K:23   H: 1   1st Qu.:1.500  
##  All-Bran                 : 1   N: 6          Median :2.000  
##  All-Bran with Extra Fiber: 1   P: 9          Mean   :2.227  
##  Almond Delight           : 1   Q: 7          3rd Qu.:3.000  
##  Apple Cinnamon Cheerios  : 1   R: 8          Max.   :3.000  
##  (Other)                  :69                                
##      weight          cups            rating         calories    
##  Min.   :0.50   Min.   :0.2500   Min.   :18.04   Min.   : 50.0  
##  1st Qu.:1.00   1st Qu.:0.6700   1st Qu.:32.69   1st Qu.:100.0  
##  Median :1.00   Median :0.7500   Median :40.11   Median :110.0  
##  Mean   :1.03   Mean   :0.8207   Mean   :42.39   Mean   :107.1  
##  3rd Qu.:1.00   3rd Qu.:1.0000   3rd Qu.:50.28   3rd Qu.:110.0  
##  Max.   :1.50   Max.   :1.5000   Max.   :93.70   Max.   :160.0  
##                                                                 
##      sugars         protein           fat          sodium     
##  Min.   : 0.00   Min.   :1.000   Min.   :0.0   Min.   :  0.0  
##  1st Qu.: 3.00   1st Qu.:2.000   1st Qu.:0.0   1st Qu.:135.0  
##  Median : 7.00   Median :2.000   Median :1.0   Median :180.0  
##  Mean   : 7.08   Mean   :2.493   Mean   :1.0   Mean   :163.9  
##  3rd Qu.:11.00   3rd Qu.:3.000   3rd Qu.:1.5   3rd Qu.:215.0  
##  Max.   :15.00   Max.   :6.000   Max.   :5.0   Max.   :320.0  
##                                                               
##      fiber       
##  Min.   : 0.000  
##  1st Qu.: 1.000  
##  Median : 2.000  
##  Mean   : 2.173  
##  3rd Qu.: 3.000  
##  Max.   :14.000  
## 

As you can see we have a relatively simple dataset with a few categorical variables and mostly continuous variables. We’ll use this for our plots today. To get us started, let’s see if the amount of sugar correlates with the cereal’s rating.

library(ggplot2)
ggplot(cereal, aes(sugars, rating)) + 
    geom_point()