This handout accompanies the workshop given on August 21, 2019 at UGA’s DigiLab in the Main Library. There is substantial overlap with previous workshops on ggplot2, though the use of the Stranger Things dataset is new as is the bar plots with the girl names data (thanks John Hale for the last minute recommendation!). Please visit joeystanley.com/r for the latest materials.

This is the first of three workshops in the Data Visualization series devoted to ggplot2. This workshops will cover how to do some basic plots. The next one will explore more of the ggplot2 syntax and see how to modify aspects of your plot like the colors and how to reorder things. After that, we’ll dive into more advanced topics, and look at how to change the overall “theme” of your plot, including how to add custom themes to match your powerpoint slides.

The goal for this series is not to cover every aspect of ggplot2. Instead, I hope to expose you to some basic code with the hopes that you leave being able to apply this code to your own data.

To get the most out of this workshops, it is expected that you have some experience with R. I don’t expect you to be a pro, but I’m assuming you have been able to get your data into R, you’ve run some functions, and that you’re familiar with the basics.

1 The basics

1.1 Downloading and Installation

ggplot2 does not come standard with R, so you’ll have to install it to your computer. Luckily, this is pretty straightforward and can be done just like any other R package.

install.packages("ggplot2") # If you haven't done so already.

Again, you only need to do this once, unless you want to update the package. If you have ggplot2 already installed on your computer, it might be worth it to reinstall it anyway because ggplot2 3.0 was released in July 2018 and it’s a good idea to get the latest version.

What you do need to do every time you run R is to load the package using the library() function. Go ahead and do that now.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

1.2 Data for this workshop

There are two datsetsa that we’ll be working with in this workshop.

1.2.1 Stranger Things

The first is a spreadsheet of basic information about each episode of Stranger Things. The information is available from IMDB, where you can get data for over 5,000 movies and TV shows. I’ve gone ahead and done some simple prep already (like isolating just the Stranger Things episoes) so it should be easy to work with. You can read in this file directly from there into R like this:

stranger <- read.csv("http://joeystanley.com/data/stranger.csv")

Let’s inspect this dataset just so we have a better idea of what it looks like.

View(stranger)
summary(stranger)

##                  title        season     episode         rating   
##  Dig Dug            : 1   Min.   :1   Min.   :1.00   Min.   :6.1  
##  E Pluribus Unum    : 1   1st Qu.:1   1st Qu.:3.00   1st Qu.:8.5  
##  Holly, Jolly       : 1   Median :2   Median :5.00   Median :8.8  
##  MADMAX             : 1   Mean   :2   Mean   :4.68   Mean   :8.7  
##  Suzie, Do You Copy?: 1   3rd Qu.:3   3rd Qu.:7.00   3rd Qu.:9.0  
##  The Bathtub        : 1   Max.   :3   Max.   :9.00   Max.   :9.4  
##  (Other)            :19                                           
##      votes          minutes     
##  Min.   :10309   Min.   :41.00  
##  1st Qu.:11693   1st Qu.:48.00  
##  Median :13148   Median :51.00  
##  Mean   :13485   Mean   :52.08  
##  3rd Qu.:14909   3rd Qu.:55.00  
##  Max.   :19185   Max.   :77.00  
##

So we have the title of the episode, the the season number, the episode number, its average rating on IMDB, the number of votes to produce that rating, and how long the episode was in minutes.

As one small bit of prep, right now the season number is being treated as a number, when we really want it as a factor. Let’s change that really quickly:

stranger$season <- factor(stranger$season)
summary(stranger)

##                  title    season    episode         rating   
##  Dig Dug            : 1   1:8    Min.   :1.00   Min.   :6.1  
##  E Pluribus Unum    : 1   2:9    1st Qu.:3.00   1st Qu.:8.5  
##  Holly, Jolly       : 1   3:8    Median :5.00   Median :8.8  
##  MADMAX             : 1          Mean   :4.68   Mean   :8.7  
##  Suzie, Do You Copy?: 1          3rd Qu.:7.00   3rd Qu.:9.0  
##  The Bathtub        : 1          Max.   :9.00   Max.   :9.4  
##  (Other)            :19                                      
##      votes          minutes     
##  Min.   :10309   Min.   :41.00  
##  1st Qu.:11693   1st Qu.:48.00  
##  Median :13148   Median :51.00  
##  Mean   :13485   Mean   :52.08  
##  3rd Qu.:14909   3rd Qu.:55.00  
##  Max.   :19185   Max.   :77.00  
##

There we go.

1.2.2 McDonald’s menu items

The second dataset that we’ll be working with is a spreadsheet of McDonald’s menu items. This file contains some nutritional information such as calories, fat, and sugars, as well as the item name and category. It is available for free at Kaggle.com, where you can get complete nutritional information. I’ve got a subset of this data on my website, so you can just read in this file directly from there into R like this:

menu <- read.csv("http://joeystanley.com/data/menu.csv")

Let’s inspect this dataset just so we have a better idea of what it looks like.

View(menu)
summary(menu)

##                Category                                       Item    
##  Coffee & Tea      :95   1% Low Fat Milk Jug                    :  1  
##  Breakfast         :42   Apple Slices                           :  1  
##  Smoothies & Shakes:28   Bacon Buffalo Ranch McChicken          :  1  
##  Beverages         :27   Bacon Cheddar McChicken                :  1  
##  Chicken & Fish    :27   Bacon Clubhouse Burger                 :  1  
##  Beef & Pork       :15   Bacon Clubhouse Crispy Chicken Sandwich:  1  
##  (Other)           :26   (Other)                                :254  
##        Oz            Calories           Fat              Sugars      
##  Min.   : 1.000   Min.   :   0.0   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.: 6.775   1st Qu.: 210.0   1st Qu.:  2.375   1st Qu.:  5.75  
##  Median :12.000   Median : 340.0   Median : 11.000   Median : 17.50  
##  Mean   :12.803   Mean   : 368.3   Mean   : 14.165   Mean   : 29.42  
##  3rd Qu.:16.000   3rd Qu.: 500.0   3rd Qu.: 22.250   3rd Qu.: 48.00  
##  Max.   :32.000   Max.   :1880.0   Max.   :118.000   Max.   :128.00  
##

1.2.3 Top 25 Girl Names in 2017

The last dataset we’ll use in this workshop is a small one that lists the top 25 most common baby girl names in the US in 2017. The data comes from the US census and I created it with the help of the babynames package. It’s also stored on my website, but this time, it’s a tab-delimited file.

I made the filetype a little different on purpose to quickly show how to read other file formats in. Here, instead of read.csv, I’ll use read.delim, with the added argument that the thing that separates the columns is a tab, which is represented as " in R.

girlnames <- read.table("http://joeystanley.com/data/girlnames.txt", header = TRUE)
girlnames

##         name     n
## 1       Emma 19738
## 2     Olivia 18632
## 3        Ava 15902
## 4   Isabella 15100
## 5     Sophia 14831
## 6        Mia 13437
## 7  Charlotte 12893
## 8     Amelia 11800
## 9     Evelyn 10675
## 10   Abigail 10551
## 11    Harper 10451
## 12     Emily  9746
## 13 Elizabeth  8915
## 14     Avery  8186
## 15     Sofia  8134
## 16      Ella  8014
## 17   Madison  7847
## 18  Scarlett  7679
## 19  Victoria  7267
## 20      Aria  7132
## 21     Grace  6991
## 22     Chloe  6912
## 23    Camila  6752
## 24  Penelope  6639
## 25     Riley  6343

When you use ggplot2 to make visualizations of your own data, you’ll have to load it in and make sure it’s clean and tidy just like the sample datasets are. I won’t go over how to tidy your databut a key part of creating good visualizations is good data.

1.3 Blank plots

Okay, finally, we’re ready to plot! The main function in ggplot2 is the ggplot function. In fact, you call this function without any arguments, and it’s still valid R code.

ggplot()

When you run this line of code, in the bottom right panel of RStudio, the Plots tab is selected and this new visual appears. It doesn’t do much other than produce a gray rectangle though. In fact, it’s a coordinate system without any axes. This is the base layer that everything else gets added on top of. But it’s important to see what your blank canvas is, so to speak.

The first argument in the ggplot function is the data argument. To make a visualization of a particular dataset, just add data = plus the name of your dataset. We’ll use the stranger dataset that you should have downloaded earlier.

ggplot(data = stranger)

Great. All this does is create that same blank gray rectangle. ggplot2 is smart but it’s not that smart: you’ll have to tell it what to do with the data. What we do from here is to build a plot one layer at a time. The way to add layers is typically through one or more geom_* functions. The full list is long, but some of the functions that you might use include geom_point() for scatterplots, geom_boxplot() for boxplots, geom_bar() for bar charts, and geom_map() for maps. For the rest of the workshop we’ll be working with several geoms one at a time and discussing how to use them to make specific kinds of plots.

2 Two continuous variables

The most efficient way of showing two numeric variables at the same time is probaby going to be scatterplots. In this section we’ll also look at how to add more variables to your plot with the addition of aesthetics like color, shape, and size.

2.1 `geom_point`

Unlike the ggplot function, you must provide some additional arguments to geom_point() (and all the other geoms for that matter). You do this with the mapping argument, which wraps up the various aesthetics of the plot inside the aes() function. Let’s make a scatterplot of the Stranger Things episodes and see their average rating plotted against how many people gave it a rating. Keep in mind that a scatterplot typically requires two columns of your spreadsheet to be all numbers, so rating (on a scale from 1 to 10) and votes will work fine. We’ll put votes in the x-axis and rating in the y-axis, separated by a comma. Note that the names of these in R must match exactly how they look in the column headers (including capitalization).

ggplot(data = stranger) +
    geom_point(mapping = aes(x = votes, y = rating))

Boom. We just created a visualization. Here we can quickly see most of the episodes did pretty well. Except for one. If you’re familiar with the show, can you guess which one that was? Not only did it get the lowest score, but it also got the most number of ratings: extra 2,000 people went out of their way to give it a poor score!

With just two short lines of code, we created a decent scatterplot. In addition to the gray background, notice what else has been added:

There are now white gridlines—major and minor ones if you look closely—overlayed on the grid.
There are x- and y-axis labels (“votes” and “rating”) with the y-axis one going vertically.
There are x- and y-axis ticks with labels (10000, 12500, 15000, 17500 along the bottom and 6, 7, 8, 9 along the side). The major gridlines align with these ticks and the minor gridlines are equidistant between them. Notice that the axes are automatically set based on your data, with a little bit of expansion on all sides so that the most extreme points aren’t right on the edge.
Obviously, the black circles represent the data. Notice that these are in front of the gridlines, meaning that the data layer was added after the gridlines were.

Every aspect of this plot, including all the defaults I just mentioned, can be modified. This and subsequent workshops will cover how to do some of these modifications.

An important thing to be aware of is that the aspect ratio—how wide and tall your plot is—is determined by the plotting window on your screen. If RStudio is not full screen or if your plotting area is small, your plot may look a little squished. You can get a bigger and better view by clicking on the “Zoom” button, which will open a separate window for you. You can resize this window however you want, and the plot will dynamically update. This is important when it comes time to export your images because what you see in the plotting area is not necessarily what you get in the saved file. To help with this, you can set the aspect ratio so that it’s fixed. But we’ll get into that when it comes time to exporting your data.

2.2 Adding aesthetics

We can modify this chart by adding more aesthetics. Sometimes it’s useful to have the colors vary depending on some additional variable. In this dataset, we have a column that refers to each season that the episode came from. To add that to the plot, we add it as a third aesthetic:

ggplot(data = stranger) +
    geom_point(mapping = aes(x = votes, y = rating, color = season))

What ggplot has automatically done is determine what all the categories are in your dataset and assign them each a color. By default, the categories are in alphabetical order and the colors are equidistant shades from red to purple. There’s also a new element to the graph, the legend, which is centered vertically to the right of the main plotting area, which is now a bit narrower to make room for the legend.

In addition to color, there are other aesthetics that you could add, either as additional elements to your plot or combined with what you have. For example, we could vary the size of the points depending on how long the episode was.

ggplot(data = stranger) +
    geom_point(mapping = aes(x = votes, y = rating, color = season, size = minutes))

If we wanted to really emphasize the differences in the seasons, we could add shape as well as color. With only three seasons, this isn’t too bad and actually enhances the plot a little bit.

ggplot(data = stranger, fig.height = 5) +
    geom_point(mapping = aes(x = votes, y = rating, color = season, size = minutes, shape = season))

So first off, shapes are only good for categories where there are just a few options, like less than 6. If we were to do more (like The Office or Cheers), ggplot2 would give you a warning saying too many shapes is hard to read.

Right now we have four different pieces of information plotted on a single plot: the number of votes is on the left-to-right dimension, the rating is top-to-bottom, the season number is color and shape, and the length of the episode is size. That’s a lot of things for a human to process. It’s to the point to where it’s hard to draw any meaningful conclusions about some of these variables from the plot.

This gets into the idea of why we make visualizations in the first place. We want to display a large amount of data in a way that makes it easy to digest and see patterns. By throwing all these variables into one plot, we don’t accomplish this purpose. It would be better to make separate plots that are easy to understand than one mega plot with everything.

In case you’re curious, you can see all the aesthetics that can be modified in a scatterplot by looking at the help page for geom_point (you can do this by typing ?geom_point). In the “Aesthetics” section of that help page, you’ll see that x and y are required, but you can change other things like alpha (transparency), color, fill, group, shape, size, and stroke. I’ll let you explore these on your own.

Now you try!

The challenge

Modify/remove the color, size, and/or shape of the above plot to get different cleaner plots. To learn more about variable types and how they work as additional aesthetics in ggplot2, try to answer the following questions.

What kinds of colors do you see if you set a categorical variable to be the color? What about a continuous variable?
What kind of variable is good for the size aesthetic?
The shape aesthetic only takes one kind of variable type. Which is it?

The solution

In these two plots, we see that if you set a categorical variable (like season) as the color, it will produce a rainbow theme with the colors maximally divergent from each other. But if you set a continuous variable (like votes) as the color, it will produce a gradient color scheme going from black to a blueish by default.

ggplot(data = stranger, fig.height = 5) +
    geom_point(mapping = aes(x = votes, y = rating, color = season))

ggplot(data = stranger, fig.height = 5) +
    geom_point(mapping = aes(x = votes, y = rating, color = minutes))

These plot shows that size works well for a continuous variable, but not so much for a categorical variable, and in fact it gives you a warning message saying so. How polite!

ggplot(data = stranger, fig.height = 5) +
    geom_point(mapping = aes(x = votes, y = rating, size = season))

## Warning: Using size for a discrete variable is not advised.

ggplot(data = stranger, fig.height = 5) +
    geom_point(mapping = aes(x = votes, y = rating, size = episode))

Finally, shape only works for a categorical variable, even then it’s only best with 6 or fewer categories. It won’t even let you set a continuous variable to be the shape.

ggplot(data = stranger, fig.height = 5) +
    geom_point(mapping = aes(x = votes, y = rating, shape = season))

# Throws an error if executed.
ggplot(data = stranger, fig.height = 5) +
    geom_point(mapping = aes(x = votes, y = rating, shape = episode))

2.3 `geom_text` and `geom_label`

As an added bit of information, sometimes it’s nice to see the name of the episode instead of just a dot We can accomplish this either by using geom_text or geom_label. They are essentially the same thing, but the latter is a little easier to read.

To use either of these, we need to add one more aesthetic: label. The column you’ll select is the one that’ll be displayed on your graph instead of the dot.

ggplot(data = stranger) +
    geom_text(mapping = aes(x = votes, y = rating, color = season, label = title))

ggplot(data = stranger) +
    geom_label(mapping = aes(x = votes, y = rating, color = season, label = title))

With only about 25 points, we can reasonably see each title pretty well. If you’re familiar with the story, it looks like the ones top right, the highest rated ones, are the finales. We can confirm this by putting the episode number as the label (showing that you can supply a continuous variable as the label too).

ggplot(data = stranger) +
    geom_label(mapping = aes(x = votes, y = rating, color = season, label = episode))

Yep, the season finales were all rated the highest. Other applications of geom_text or geom_label in linguistics might be when you need to make scatterplots where each individual word is displayed rather than dots. In your dataset you might use names of states, cars, people, events, etc. which can make your graph a lot more useful.

Your Turn!

The challenge

Continue to modify the scatterplot using this data. Try putting different variables in for the various aesthetics (including the axes). Imagine you have something you want to convey to an audience about Stranger Things episodes with this data; make the best plot you can that most clearly shows that idea.
What kinds of things can be shown using a scatterplot of your own data. Be specific: what aesthetics would you use and what information would you associate with each?

The solution

This is an open ended question, so there’s no real right or wrong answer, but here are some plots I came up with.

ggplot(data = stranger) +
    geom_text(mapping = aes(x = minutes, y = rating, color = season, label = title))

ggplot(data = stranger) +
    geom_point(mapping = aes(x = episode, y = votes, color = season))

ggplot(data = stranger) +
    geom_point(mapping = aes(x = votes, y = minutes, shape = season, color = season))

3 One variable

With scatterplots and the addition of colors, shapes, and sizes, it’s easy to display a lot of variables all at once. Sometimes we just need to simplify things and just show one variable. One way to do this is with bars. Here we’ll learn about these bar charts and their cousin, the histogram. We’ll also look at modifying global properties to your plot, rather than having them alternate with a variable.

3.1 `geom_bar`

Barplots are usually used when displaying how many of each category there are in a categorical variable. Since there are either 8 or 9 episodes for each season of Stranger Things, it doesn’t make for particularly interesting plots. So we’ll switch over to the McDonald’s menu items dataset for this section.

So here, we might want to use a bar plot to show how many stranger items of each category there are. To do this, we can use the geom_bar function. The only aesthetic we need is just the x argument, which would be the column that contains the categories that you want each bar to belong to. In this dataset, the name of that column is Category coincidentally.

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category))

Notice what this plot does. Starting with the gray background layer, it superimposes several things: labels across the bottom for each menu category (they’re overlapping here, but on in R you can widen it yourself) with an axis label (Category), ticks across the left side with an axis label (count), another set of major and minor white grid lines, and the bars themselves. Here, we can see that the majority of menu items are actually coffee and tea products. This is likely because each individual size is treated as a separate item in this dataset because each size has its own nutritional amounts.

Okay, so now let’s try to fill in each bar with a different color per category. You’d think that the way we do that is to add color = Category, right?

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category, color = Category))

Oops! What did this do (other than create a kinda cool looking plot)? For geom_bar, it turns out that the color aesthetic changes the outline of the bar. If we want to fill it in, we have to use fill instead of color:

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category, fill = Category))

But what if we want all of the bars to be their own color but the outline to be just one color? If we want to apply some property to all items within a geom, we add that property outside of the aes function as a separate argument. For example, we can make the outlines all black while still retaining the colored bars. (Here, I use the color slategrey. You can find a complete list of color keywords in R by going here.)

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category, fill = Category), color = "slategrey")

At this point, the sky is the limit as far as colors. You can set fill or color to be global properties or as aesthetics of the plot, or some combination of the two. For a minimalistic approach, you could set them both as global options and find an outline color that go well together.

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category), fill = "white", color = "royalblue")

There are a couple other things you can adjust for this plot. If you want to tweak the width of the bars themselves, you can do so with the width argument The default is 1, so use a larger number for wider bars and a smaller number for skinnier bars. Note that this is not an aesthetic, so it can’t vary by some variable. In other words, it’s an argument of geom_bar() rather than aes().

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category), fill = "springgreen3", color = "tan4", width = 3)

## Warning: position_stack requires non-overlapping x intervals

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category), fill = "springgreen3", color = "tan4", width = 1/3)

You can even change the thickness of the outline using the size property. Just like width, this must be a global property, and the default is 1.

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category), fill = "mediumvioletred", color = "midnightblue", size = 3)

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category), fill = "mediumvioletred", color = "midnightblue", size = 1/3)

So barplots are good because you can quickly show how many of each category there are. I prefer them to pie charts because judging the height of bars is easier than the angle of pie wedges.

3.2 What if my data is already summarized?

The previous section works great if your data is formatted in the correct way. Specifically if you’ve got lots of data and you want the bars to automatically summarize them for you. So for example there were 260 menu items in the menu dataset, but when I made the barplot, there were only 9 bars. I didn’t tell ggplot2 that there were 15 Beef & Pork menu items, because it summarized it for me.

Well, what if we’ve already done the summarizing? Let’s take a look at our girlnames dataset.

girlnames

##         name     n
## 1       Emma 19738
## 2     Olivia 18632
## 3        Ava 15902
## 4   Isabella 15100
## 5     Sophia 14831
## 6        Mia 13437
## 7  Charlotte 12893
## 8     Amelia 11800
## 9     Evelyn 10675
## 10   Abigail 10551
## 11    Harper 10451
## 12     Emily  9746
## 13 Elizabeth  8915
## 14     Avery  8186
## 15     Sofia  8134
## 16      Ella  8014
## 17   Madison  7847
## 18  Scarlett  7679
## 19  Victoria  7267
## 20      Aria  7132
## 21     Grace  6991
## 22     Chloe  6912
## 23    Camila  6752
## 24  Penelope  6639
## 25     Riley  6343

Here we have 25 rows, and we want to make a bar chart showing that Emma was the most common name and that names like Camila, Penelope, and Riley were less common. We can make a bar plot using the existing syntax on this dataset with no errors, but the result is not what we expect:

ggplot(girlnames, aes(x = name)) + 
    geom_bar()

What is going on here? As it turns out, what ggplot is showing is that there is exactly one row in the dataset for each of the 25 names. We kinda knew that already. So how can we incorporate the n column?

The key here is to use the stat = "identity" argument in geom_bar. What this does is it tells ggplot2 to use information in one of the other columns as the heights of the bars rather than trying to count them itself. To do this, we’ll need to tell ggplot which column we want to use, so we’ll add y = n within the aes() function. With those changes, our barplot should look better.

ggplot(girlnames, aes(x = name, y = n)) + 
    geom_bar(stat = "identity")

Okay, so this is better, but somewhat unexpected still. Ideally, I think we’d want the names in order of frequency, and right now, if you can see past the overlapping text, they’re alphabetical. Now, data cleaning isn’t a huge focus of this workshop, so I’ll be brief, but one way to force the bars to be in order of frequency is by modifying how R is treating that column under the hood. Specifically, we’re turning it into a factor, and telling it that the order of the names should be the order they’re currently in, rather than the default alphabetical order:

girlnames$name = factor(girlnames$name, levels = girlnames$name)

We can now use identical code on this now-modified girlnames dataset and it should look better.

ggplot(girlnames, aes(x = name, y = n)) + 
    geom_bar(stat = "identity")

The last thing we’ll want to fix is probably to turn those names sideways maybe so that we can actually read them. We can do this with the help of theme, a powerful function that in reality controls every aspect of the visual appearance of the plot. There are like a hundred arguments to theme, but we’ll just focus on one, axis.text.x, and we’ll set the angle to 90 degrees and right-align the names:

ggplot(girlnames, aes(x = name, y = n)) + 
    geom_bar(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

So with geom_bar, we can allow the function to summarize the data for us, or we can tell it to use the already-summarized data. Either way, we can make decent-looking bar charts with relatively few lines of code.

3.3 `geom_histogram`

Related to the bar plot is the histogram. On the surface, the share a lot of similarities, but their underlying data is different. For the bar plot, we supplied ggplot with a categorical variable with discrete, unordered categories. A histogram makes a similar plot with a numeric variable, so you can see the distribution of the data. If we see the distribution of fat using a historgram, we can see that the majority of menu items are on are relatively lower end in fat.

ggplot(data = menu) +
    geom_histogram(mapping = aes(x = Sugars))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You’ll notice there was a warning message saying something about a binwidth. This is the way to adjust the width of the bars, but it’s a little bit different with geom_historgram than with geom_bar. Since we’re dealing with numbers, a wider bar would take up more space on the x-axis, meaning it eats up more numbers along that axis. For example, the default width here is about 5 units per bar. So the furthest bar to the left shows how many menu items have 0–5 grams of fat. If we made that bar wider, it might cover 0–7 on the graph, therefore changing the amount of data, which changes the height of the bar. For histograms, the width of these bars is called the binwidth since we’re dividing the data up into different “bins.” If you change it to smaller or wider bins, notice how the shape of the graph changes.

ggplot(data = menu) +
    geom_histogram(mapping = aes(x = Sugars), binwidth = 10)

ggplot(data = menu) +
    geom_histogram(mapping = aes(x = Sugars), binwidth = 1)

ggplot(data = menu) +
    geom_histogram(mapping = aes(x = Sugars), binwidth = 25)

Finding the right binwidth is an important part about histograms. If you do it wrong, you’ll end up misrepresenting your data. Too wide and it simplifies it. Too narrow and you miss the overall curve. It’s always worth it to change the binwidth a little bit each time you make a histogram to find which one works best for you. There are algorithms made that determine what the best width should be, but it’s usually fine to just eyeball it.

These chart is a little hard to read since columns are all right next to each other. We can change this using the fill and color attributes just as before:

ggplot(data = menu) +
    geom_histogram(mapping = aes(x = Sugars), fill = "gold", color = "orangered", binwidth = 4)

Your Turn!

The challenge

Think of your own data: when might you use a chart like these. Would you use bar chart or a histogram? How do you know?
Create two charts: one that shows how many items there are per category and one that shows the distribution of fat overall. Alter the aesthetics to make them look professional.
What happens when you use geom_bar with a continuous variable (like Fat)? Try to replicate that same plot using geom_histogram.

The solution

For the first question bar charts are generally better for categorical data (and there could be many different categories) while histograms are better for continuous data.

Here are the two charts that show (1) how many items there are per category and (2) the distribution of fat overall.

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Category), fill = "white", color = "royalblue")

ggplot(data = menu) +
    geom_histogram(mapping = aes(x = Fat), fill = "white", color = "royalblue", binwidth = 5)

Finally, when you use geom_bar on a a continuous variable, it does one bar per number. You can replicate that pretty closely with geom_histogram by setting the binwidth to 0.5 (since we have Fat content in units of half grams). There will be some small differences like axis lines and very slight differences in the widths of the outlines but we can ignore those.

ggplot(data = menu) +
    geom_bar(mapping = aes(x = Fat), fill = "white", color = "orange")

ggplot(data = menu) +
    geom_histogram(mapping = aes(x = Fat), fill = "white", color = "orange", binwidth = 0.5)

4 One continuous variable and one categorical variable

So far we’ve looked at what kinds of plots you can do with two continuous variables, and with just one variable (variable or continuous). The natural extension to this is what happens when you have one categorical variable and one continuous variable. We’ll start with the more typical boxplots but then move on to violin plots. We’ll also cover how to add even more layers to your graph.

4.1 `geom_boxplot`

A box(-and-whisker) plot is something you might learn about in statistics because it does a decent job at summarizing your data. It shows the average, distribution, and outliers of your data. We can make a basic boxplot using the geom_boxplot function. Here, we need a categorical x variable and a numeric y variable.

ggplot(data = menu) +
    geom_boxplot(mapping = aes(x = Category, y = Sugars))

This plot shows that smoothies and shakes generally have the most amount of sugar. It also shows that beef and pork items roughly have the same amount of sugar (because the box is very squashed, meaning the distribution is narrow) while coffee and tea vary quite a bit more (that box is quite a bit taller so they vary more).

Just like the plots we’ve seen before, we can modify some of the properties of this plot using the same aesthetics as before (not that you’d want your plot to look like this, but it serves as a good illustration).

ggplot(data = menu) +
    geom_boxplot(mapping = aes(x = Category, y = Sugars, color = Category), width = 0.75, size = 1.5, fill = "black")

One problem with boxplots is they abstract away from the actual data. It’s actually possible to get identical boxplots with wildly different distributions of data. For this reason, sometimes it’s nice to plot the points themselves in addition to the boxplot. How can we plot points if we already have geom_boxplot? The answer is simple: just do both!

ggplot(data = menu) +
    geom_boxplot(mapping = aes(x = Category, y = Sugars)) +
    geom_point(mapping = aes(x = Category, y = Sugars))

The code here is relatively straightforward. We’ve seen before how we can add a layer to the base gray rectangle by adding a plus sign and then some geom. Well we can add as many layers as we want in the same way: just add a + and then some other ggplot2 function. Here, we’re adding a scatterplot on top of the boxplot. The order here is important: the layers are added in the order that they appear in your code: first the base gray, then the boxplots, then the points. So if we switch the latter two around, you’ll see that the boxplot covers the points (color added for clarity):

# Same graph, but plotting the points first.
ggplot(data = menu) +
    geom_point(mapping = aes(x = Category, y = Sugars), color = "blue") +
    geom_boxplot(mapping = aes(x = Category, y = Sugars), color = "red")

Something else you may have noticed is that there’s a bit of repetitive code there. In both geoms we need to specify mapping = aes(x = Category, y = Sugars). This is is functional code, but it can be cumbersome. For example, if we wanted to switch from Sugars to Fat, you’d have to make the change twice. It would be easier if we could somehow combine them both.

In fact, we can! We can actually move that whole mapping part up to the ggplot function. Since this is the base layer, it passes that information on to all the subsequent layers as if they were explicitly typed there. In fact, we’ve already been doing that with data = menu. The data argument could have been specified in each layer, but we just made it available to all layers by keeping it in the ggplot function. Thus, an equivalent plot as the one with points overlaid on the boxplot, but with cleaner code, might look like this.

ggplot(data = menu, mapping = aes(x = Category, y = Sugars)) +
    geom_boxplot() +
    geom_point()

This saves a bit of typing, but it also makes your code a little bit easier to read as well. We can still have aesthetics and other properties in the geom functions if we want to override what’s in ggplot, or specify something that should be in that layer only. In the following chart I specify that only the boxplot should be blueviolet and that the dots should reflect the Fat data instead of the Sugar. (In fact, you can even specify a whole different dataset in individual geoms, but we won’t get to that here.)

ggplot(data = menu, mapping = aes(x = Category, y = Sugars)) +
    geom_boxplot(color = "blueviolet") +
    geom_point(mapping = aes(y = Fat))

This is not very useful and is an incredibly misleading plot, but it serves as a good illustration of what’s possible. The layer-by-layer approach to graph building gives you incredible flexilbilty. You’d have a really hard time doing the same thing in Excel.

4.2 geom_violin

Boxplots have been around for a while, and their various components can easily be calculated and drawn by hand. However, it’s very possible to have completely different sets of data produce identical boxplots. This is beautifully illustrated by Justin Matejka and George Fitzmaurice in a recent paper (which you can and should read at https://www.autodeskresearch.com/publications/samestats)

The evolution of boxplots are violin plots, which are more faithful to the underlying data. We can create one in ggplot2 by simply swapping out geom_boxplot with geom_violin:

ggplot(data = menu, mapping = aes(x = Category, y = Oz)) +
    geom_violin()

The shape of these plots sort of show the distribution of the data. Wider portions mean more data around there. The plot gets its name because sometimes the plots look like violins. You can compare violin plots to the raw distribution by overlaying the points again.

ggplot(data = menu, mapping = aes(x = Category, y = Oz)) +
    geom_violin() +
    geom_point()

One thing that is useful about functions in many programming languages (including R and thus ggplot2) is that the name of the argument can be left off. So far in all of our code, we’ve had to type data = and mapping = every time. The arguments of a function in R have a default order, so if you know that order, you can leave off the name of the argument. In ggplot2 (and all tidyverse packages actually), the first argument is always data =. So if you just type the name of your data frame, ggplot2 will assume that it’s the data frame. The second argument in ggplot2 is mapping, so you can leave that off as long as the aes() function comes second.

For example, the following two are identical:

ggplot(data = menu, mapping = aes(x = Category, y = Oz)) +
    geom_violin(color = "blueviolet") +
    geom_point(mapping = aes(y = Fat))
ggplot(menu, aes(Category, Oz)) +
    geom_violin(color = "blueviolet") +
    geom_point(aes(y = Fat))

Note that for geom_point, I was able to leave off mapping = because in geom_point, mapping is the first argument. Even within the aes() function, the first and second arguments are x and y, meaning we can leave those off too. However, in geom_violin, I had to specifically say that "blueviolet" is the color argument, because color is not the first argument of geom_violin. When in doubt, always specify the argument, but it does save some typing when you can leave them off.

Your turn!

The challenge

Think of your own data and what things you could show using boxplots or violin plots. Keep in mind the data types required to make these plots. What kinds of things can you not show using these plots?
Try swapping out geom_point for geom_jitter. What does that do to your plot and how is it helpful?
Take a few minutes and play around with boxplot and violin plots for Sugars, Fat, and Oz. Choose one and make a good plot that is faithful to the data. Play around with the aesthetics to make it look good. Be sure to make your code concise yet readable. Keep in mind your audience: if you’re going to be the only one seeing this code, do what makes sense for you, but if you’re going to share the code (and you never know if you will), do what is most readable generally.

The solution

Boxplots and violin plots are good for one categorical variable and one continuous variable. In theory, you could add as many different categories within that categorical variable, but too many gets hard to read. It’s best if the number of things you’re comparing is relatively few (maybe 2 to 20?).

If you use geom_jitter instead of geom_point, it will randomly distribute the points along the x-axis within the width of the boxplot.

ggplot(menu, aes(Category, Oz)) +
    geom_boxplot() +
    geom_jitter()

This is actually quite useful because it shows all the data. When using geom_point, if you have many points with the exact same weight (or whatever value is on the y-axis), they’ll just show up as one dot because they’re just on top of each other. With geom_jitter, it’ll distribute them out more so you can see the data more.

Here are just some plots I made with the data that illustrate some of the ideas we’ve covered in this section.

ggplot(menu, aes(Category, Sugars)) +
    geom_boxplot(fill = "black", color = "grey75") +
    geom_jitter(aes(color = Category))

ggplot(menu, aes(Category, Fat)) +
    geom_boxplot()

ggplot(menu, aes(Category, Oz)) +
    geom_violin(color = "forestgreen") +
    geom_jitter(color = "grey15")

5 Conclusions

This chapter can only cover so much, but I’ve tried to show lots of different things that can be done with ggplot2 without a lot of work. Before we finish, it’s nice to see where you can go to learn more.

5.1 Where to go for help

I’ll be the first to admit that working with ggplot2 is rough and there is a learning curve. However, after a while, especially once you’ve figured out what kinds of plots you tend to do with your own data, things come naturally. In the mean time, here are some places that I have found to be handy bookmarks when struggling with ggplot2, many of which are produced by Hadley Wickham himself.

Springer has published a book in their “Use R!” series that’s simply called ggplot2 http://www.springer.com/us/book/9783319242750. It’s written by Wickham and is basically the full documentation for the package. It’s comprehensive but still aimed at learners. Your university may have access to this for free through their library.
There’s a cheatsheet available by going to RStudio -> Help -> Cheatsheets -> Data Visualization within ggplot2. It’s not exactly easy to learn from it, but it’s a good reminder of things you’ve learned already.
There’s always help within R itself. You can type ?geom_point for example and get help on that function.
ggplot2’s official website, http://ggplot2.tidyverse.org, is a good launching pad. From there you can find additional resources, tutorials, and presentations that Wickham has done. Two such resources can be found at http://r4stats.com/examples/graphics-ggplot2/ and http://r4ds.had.co.nz/data-visualisation.html.
A couple other presentations that Wickham has done on ggplot2 available here and here.
One great site that appears a lot when I google for help is the book R Graphics Cookbook, and its accompanying website, cookbook-r.com. In particular, they have a section on graphs (http://www.cookbook-r.com/Graphs/), which mostly covers ggplot2. There, you can see how to manipulate things like colors, text size, legends, axes, and all sort of other stuff. It’s probably the resource I used the most when learning ggplot2. They have lots of examples of the many things that are possible to manipulate, showing that you really do have control over every aspect of the plot.
If you have access to Lynda.com, there are some great courses on R and ggplot2 there.

5.2 Looking ahead

In the next workshop we’ll cover ways to customize the plots you’ve already created. We’ll look at how to change the default colors, how to put things in a different order, modiying axis labels, working with legends, title, faceting, themes, and saving plots. After that we’ll dive into the nitty gritty of your plots to show you how you can customize whatever you want about it.

An intro to data visualization in R using ggplot2

Joey Stanley

8/21/2019

1 The basics

1.1 Downloading and Installation

1.2 Data for this workshop

1.2.1 Stranger Things

1.2.2 McDonald’s menu items

1.2.3 Top 25 Girl Names in 2017

1.3 Blank plots

2 Two continuous variables

2.1 `geom_point`

2.2 Adding aesthetics

Now you try!

The challenge

The solution

2.3 `geom_text` and `geom_label`

Your Turn!

The challenge

The solution

3 One variable

3.1 `geom_bar`

3.2 What if my data is already summarized?

3.3 `geom_histogram`

Your Turn!

The challenge

The solution

4 One continuous variable and one categorical variable

4.1 `geom_boxplot`

4.2 geom_violin

Your turn!

The challenge

The solution

5 Conclusions

5.1 Where to go for help

5.2 Looking ahead

An intro to data visualization in R using ggplot2

Joey Stanley

8/21/2019

1 The basics

1.1 Downloading and Installation

1.2 Data for this workshop

1.2.1 Stranger Things

1.2.2 McDonald’s menu items

1.2.3 Top 25 Girl Names in 2017

1.3 Blank plots

2 Two continuous variables

2.1 geom_point

2.2 Adding aesthetics

Now you try!

The challenge

The solution

2.3 geom_text and geom_label

Your Turn!

The challenge

The solution

3 One variable

3.1 geom_bar

3.2 What if my data is already summarized?

3.3 geom_histogram

Your Turn!

The challenge

The solution

4 One continuous variable and one categorical variable

4.1 geom_boxplot

4.2 geom_violin

Your turn!

The challenge

The solution

5 Conclusions

5.1 Where to go for help

5.2 Looking ahead

2.1 `geom_point`

2.3 `geom_text` and `geom_label`

3.1 `geom_bar`

3.3 `geom_histogram`

4.1 `geom_boxplot`