This workshop introduces the R package ggplot2. After some introductory discussions of visualizations and some basic data types, we dive right into how to make different kinds of basic plots. This workshop does not teach every aspect of ggplot2, but instead exposes you to some basic code to make some plots, with the hopes that you leave being able to apply this code to your own data.
This is the first of three ggplot2 workshops. Next week, we will explore more of the ggplot2 syntax and see how to modify aspects of your plot like the colors and how to reorder things. During the last week, we’ll dive into more advanced topics, and look at how to change the overall “theme” of your plot, including how to add custom themes to match your powerpoint slides.
To participate in this workshop, it is expected that you have some experience with R. I don’t expect you to be a pro, but I’m assuming you have been able to get your data into R, you’ve run some functions, and that you’re familiar with the basics.
What is the purpose of data visualization? We see charts, figures, graphs, and plots all over the place, but have you ever stopped to think about what it is that those are doing? It seems like we visualize data because we need to consume a lot of information all at once.
The data that we need to visualize most often takes the form of a table or spreadsheet of some sort. But unless it’s very small, it’s hard to get a good idea of trends and patterns. Some statistical methods are designed to summarize your data in various ways, including things like the mean, median, and standard deviation. But sometimes it’s just nice to be able to “see” hundreds or thousands of raw data points all at once, without any summary statistics.
Data visualization also takes two forms, split up by the intended audience: you and not you.
For yourself: Sometimes, all you need to do is create a quick-and-dirty graph so that you can get an idea of what’s going on. These types of visualizations should be easy, quick, and informative. Little details like the aesthetics of the overall image are less important. Some kinds of data visualizations are meant specifically for the researcher and are not exactly intended to be included in any sort of publication. For example, plotting the residuals of a regression model using a Q-Q plot lets the researcher know the residuals are homoscedasatic. Looking at a scree plot helps find how many principal components to use in a PCA. These plots are important because they allow the researcher to gather information that can be used to determine future analysis, but often don’t get seen by others.
For others: The other type of plot are those that are for public consumption. These are the ones you actually see on a website, presentation, or on a printed page. These are designed to convey specific information about some data in order to support an analysis. They are also meant to be visually very clean and crisp.
The key to a good visualization is that it lets the data speak for itself. The addition of extra fluff (shadows, 3D, extravagnet colors) eclipses what the graph is actually showing. A good visualization is minimal. It is also faithful to the data, and doesn’t misrepresent it by modifying axes or colors the wrong way.
Data visualization is as much of an art as it is a science. Yes it takes computational power to turn a spreadsheet into some beautiful graphic, but it takes an actual human to design the visual and make it appropriate with what you’re showing. It should be thought of as an additional tool in order to help your audience understand the idea you’re trying to convey.
If you’re anything like me, you’ve often felt like the ability to make professional charts and graphs has been impossible without some serious photoshopping skills. I’ve seen some really compelling visualizations of data in conferences and papers that really make it easy to summarize a lot of data in a single graphic. It sometimes isn’t even anything fancy: a simple bar chart or scatterplot can add a nice visual touch to an otherwise text-heavy project.
The problem with making visualizations is that a lot of the existing software has some limitations. In my opinion, good data visualization software should have these properties:
Customization: I know of a website where I can upload my data and it will produce single stunning graphic. It’s beautiful, but that’s all it does. If I want to modify the colors, rotate it, add labels, change the font, or other make any other changes, I can’t. Ideally, I should be able to customize whatever I want in my plot. And I mean everything. Font, line width, slight shades in colors, layout. Most software won’t give you that.
Professional: Some software has the flexibility of making custom plots, but they all end up looking a bit cheesy. Yes, the information is conveyed, but it ends up looking like a middle-school science fair. You have no control over graphics like 3D shapes, shadows, and other details. Ideally, data visualization should produce stunning graphics that you would not be ashamed of showing at a conference presentation or using in a journal publication. Here’s a relevant XKCD cartoon (xkcd.com/1945/): let’s not produce charts and graphs in the powerpoint/Paint era!
Avoids carpal tunnel: There is a lot of software out there that produces great graphics and you can customize it however you want, but it’s extremely tedious. There just seem to be lots of clicks and menus you have to navigate through to get small changes. Sometimes changes have to occur in a specific order, and if you want to undo something, you have to do a lot of clicking again, or start all over. I don’t like clicks. I think using the keyboard is easier on my hands and wrists, so writing code is preferable to me than any sort of menu or click-based software.
Reproducibility: Related the clicks is the idea that plots should be reproducible. Some software will let you add whatever you want to the plot, but you have to manually place things. This flexibility is desirable by some, but can be a pain for most people. For example, simple things like centering a title has to be eyeballed. Another problem with manual layout is that you’ll never quite get the same plot twice. This is especially problematic if you want to create several similar plots on different datasets or update a plot based on updated data. It’s hard or impossible to match them exactly because odds are they’ll differ is slight (but frustratingly noticeable) ways.
Excel is a temporary solution, but you should not be satisfied with those plots. The direct link to your data is nice, but all those plots look awful and are a pain to customize. Last year I gave a presentation on JMP and discussed the visualizations that are possible. Like Excel these are not very professional looking and are very tedious to customize. Above all, Excel and other software only create a certain set of visualizations. If you want to create something brand new or an interesting combination of plots, it’ll be very hard to do so in Excel.
One solution that satisfies at least all my demands in a visualization software is ggplot2. ggplot2 is an elegant and versatile R package that creates beautiful visualizations of data. It’s an R package, meaning it’s just a bunch of extra functions that have been written up and made available online for you to download. Its author is Hadley Wickham, who really has a knack for writing really good and useful R packages.
ggplot2 makes it easy to customize whatever you want in a plot. Yes, it comes with defaults, so if you just want quick and dirty visualizations, you can make those plots with no problem. But literally every aspect of the plots can be modified. This is what makes the plots look professional. Even the default settings aren’t bad, and I have seen them in professional settings. But with just a little effort, you can make really nice graphics. This is all made easier because ggplot2 is done entirely in R, meaning it’s all written as code. No carpal tunnel here. This code-based nature of it is also what makes the plots perfectly reproducible every time, so making similar plots with different data is a breeze.
The reason why ggplot2 is so good is because it approaches the creation of visualizations a little differently than you might expect. It uses what’s called the “Grammar of Graphics”, based on a book of that title by Leland Wilkinson (and is available as a free eBook download through UGA’s library!). In fact, that’s what the “gg” in ggplot2 stands for. I don’t have the time or space to go into detail about what this is, but the basic idea is that plots are built layer by layer. The fact that all the components of a plot is separated out makes them easier to manipulate and control, if you want the flexibility. It is also good because it just sort of takes care of everything for you, making it easy to use.
Before we get too carried away, I want to emphasize something: not all visualizations are meant for all kinds of data. What do I mean by this? Just as certain statistical procedures require specific types of data, certain visualizations need certain types of data.
I’ve talked to people who wanted to make a scatterplot but when I took a look at their data, I saw that it was nothing but text, which doesn’t lend itself to being a scatterplot. Scatterplots require that at least two columns in your table—variables as I’ll refer to them from now on—to be number-like. I’ve tried to help other people make other kinds of plots because they’re flashy, sexy, and are used in other papers, but the important part is that you absolutely need the right kind of data.
The main two data types that I’ll be refering to in this workshop is categorical and continuous data. Categorical data is something that can be grouped into distinct categories. These categories have no meaningful order and are mutually exclusive. Sometimes the number of categories can be small (glasses/contacts/nothing), relatively large (nationality, state of residence), or nearly inifite (favorite color, unique words). Some visualizations lend themselves well to categorical data, and some are better when there are fewer categories.
The other main kind of data is numeric or continuous data. These are numbers. These typically are things like measurements (height, weight, velocity, acoustic measurements, counting things, etc.) but can also be things like latitude and longitude. Sometimes it makes sense to have decimals (measurements, for example), and other times decimals don’t make sense (counting things). There are lots of finer distinctions between subtypes of continous data, but for now we’ll stick with just the basic concept.
Think of your own data. What kinds of categorical variables do you have? What kinds of numeric data do you have?
In the last section we talked about what makes a good With the theoretical ideas out of the way, we’re ready to start working in R.
ggplot2 does not come standard with R, so you’ll have to explicitly install it from your computer. Luckily, this is pretty straightforward and can be done just like any other R package.
install.packages("ggplot2") # If you haven't done so already.
Again, you only need to do this once, unless you want to update the package. What you do need to do every time you run R is to load the package using the library() function. Go ahead and do that now.
library(ggplot2)
Alternatively, if you also use packages like dplyr or tidyr, you can load them all at once by installing and loading the tidyverse package, which includes all three (and more).
The data that we’ll be working with is a spreadsheet of McDonald’s menu items. This file contains some nutritional information such as calories, fat, and sugars, as well as the item name and category. It is available for free at Kaggle.com, where you can get complete nutritional information. I’ve got a subset of this data on my website, so you can just read in this file directly from there into R like this:
menu <- read.csv("http://joeystanley.com/downloads/menu.csv")
Let’s inspect this dataset just so we have a better idea of what it looks like.
View(menu)
summary(menu)
## Category Item
## Coffee & Tea :95 1% Low Fat Milk Jug : 1
## Breakfast :42 Apple Slices : 1
## Smoothies & Shakes:28 Bacon Buffalo Ranch McChicken : 1
## Beverages :27 Bacon Cheddar McChicken : 1
## Chicken & Fish :27 Bacon Clubhouse Burger : 1
## Beef & Pork :15 Bacon Clubhouse Crispy Chicken Sandwich: 1
## (Other) :26 (Other) :254
## Oz Calories Fat Sugars
## Min. : 1.000 Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 6.775 1st Qu.: 210.0 1st Qu.: 2.375 1st Qu.: 5.75
## Median :12.000 Median : 340.0 Median : 11.000 Median : 17.50
## Mean :12.803 Mean : 368.3 Mean : 14.165 Mean : 29.42
## 3rd Qu.:16.000 3rd Qu.: 500.0 3rd Qu.: 22.250 3rd Qu.: 48.00
## Max. :32.000 Max. :1880.0 Max. :118.000 Max. :128.00
##
When you use ggplot2 to make visualizations of your own data, you’ll have to load it in and make sure it’s clean and tidy just like the sample datasets are. I won’t go over how to tidy your data—we’ll do that in March—but a key part of creating good visualizations is good data.
Okay, finally, we’re ready to plot! The main function in ggplot2 is the ggplot function. In fact, if you call this function without any arguments, it’s still valid R code.
ggplot()
It doesn’t do much other than produce a gray rectangle though. In fact, it’s a coordinate system without any axes. This is the base layer that everything else gets added on top of. But it’s important to see what your blank canvas is, so to speak.
The first argument in the ggplot function is the data argument. To make a visualization of a particular dataset, just add data= plus the name of your dataset. We’ll use the menu dataset that you should have downloaded earlier.
ggplot(data=menu)
Great. All this does is create that same blank gray rectangle. ggplot2 is smart but it’s not that smart: you’ll have to tell it what to do with the data. What we do from here is to build a plot one layer at a time. The way to add layers is typically through one or more geom_* functions. The full list is long, but some of the functions that you might use include geom_point() for scatterplots, geom_boxplot() for boxplots, geom_bar() for bar charts, and geom_map() for maps. For the rest of the workshop we’ll be working with several geoms one at a time and discussing how to use.
The most efficient way of showing two numeric variables at the same time is probaby going to be scatterplots. In this section we’ll also look at how to add more variables to your plot with the addition of aesthetics like color, shape, and size.
geom_pointUnlike the ggplot function, you must provide some additional arguments to geom_point() (and all the other geoms for that matter). You do this with the mapping argument, which wraps up the various aesthetics of the plot inside the aes() function. Let’s make a scatterplot of the weight of the menu items and how much sugar each has. Keep in mind that a scatterplot typically requires two columns of your spreadsheet to be all numbers, so weight (in ounces) and sugars (in grams) will work fine. We’ll put weight in the x-axis and sugars in the y-axis, separated by a comma. Note that the names of these in R must match exactly how they look in the column headers (including capitalization).
ggplot(data=menu) +
geom_point(mapping = aes(x=Oz, y=Sugars))
Boom. We just created a visualization. With just two short lines of code, we created a decent scatterplot. Note that the gray background is still there and that we’ve overlaid white grid lines (major and minor ones if you look closely), x- and y-axis labels and tick marks, and of course the points themselves, which are black circles by default. You can change every one of these layers by the way and we’ll get to some of those later.
We can modify this chart by adding more aesthetics. Sometimes it’s useful to have the colors vary depending on some additional variable. In this dataset, we have a column that refers to each category of menu item. To add that to the plot, we add it as a third aesthetic:
ggplot(data=menu) +
geom_point(mapping = aes(x=Oz, y=Sugars, color=Category))
What ggplot has automatically done is determine what all the categories are in your dataset and assign them each a color. The categories are in alphabetical order by default and the colors are equidistant shades from red to purple. You can modify both the order and the specific colors, but we’ll get to that next week.
There are other aesthetics that you could add as well, either as additional elements to your plot or combined with what you have. For example, we could vary the size of the points depending on how much fat there is in each menu item.
ggplot(data=menu) +
geom_point(mapping = aes(x=Oz, y=Sugars, color=Category, size=Fat))
If we wanted to really emphasize the differences in category, we could add shape as well.
ggplot(data=menu) +
geom_point(mapping = aes(x=Oz, y=Sugars, color=Category, size=Fat, shape=Category))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 9.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 47 rows containing missing values (geom_point).
This is where things get crazy, and ggplot lets you know that with a warning message. First off, shapes are only good for categories where there are just a few options, like less than 6. So though it is theoretically possible to include all these aesthetics into your plot, that certainly doesn’t mean you should. Right now we have four different pieces of information plotted on a single plot: Ounces is on the left-to-right dimension, Sugars is top-to-bottom, category is color and shape, and fat is size. That’s a lot of things for a human to process. It’s to the point that it’s hard to draw any meaningful conclusions from the plot.
This gets into the idea of why we make visualizations in the first place. We want to display a large amount of data in a way that makes it easy to digest and see patterns. By throwing all these variables into one plot, we don’t accomplish this purpose. It would be better to make separate plots that are easy to understand than one mega plot with everything.
Modify/remove the color, size, and/or shape of the above plot to get different cleaner plots. To learn more about variable types and how they work as additional aesthetics in ggplot2, try to answer the following questions.
What kinds of colors do you see if you set a categorical variable to be the color? What about a continuous variable?
What kind of variable is good for the size aesthetic?
The shape aesthetic only takes one kind of variable type. Which is it?
In these two plots, we see that if you set a categorical variable (like Category) as the color, it will produce a rainbow theme with the colors maximally divergent from each other. But if you set a continuous variable (like Fat) as the color, it will produce a gradient color scheme going from black to a blueish by default.
ggplot(data=menu) +
geom_point(mapping = aes(x = Oz, y = Sugars, color = Category))
ggplot(data=menu) +
geom_point(mapping = aes(x = Oz, y = Sugars, color = Fat))