This handout accompanies the workshop given on August 28, 2019 at UGA’s DigiLab in the Main Library. There is substantial overlap with my previous workshops on ggplot2. Please visit joeystanley.com/r for the latest materials.


1 Introduction

This is the second of three workshops in the Data Visualization series devoted to ggplot2. In the previous workshop, we looked at the basics of data visualization and data types and introduced the library ggplot2. Specifically, we looked at scatterplots and how we can plot shapes, color, size, and text. We looked at plotting one variable, whether it be categorical with a barplot or continuous with a histogram. Finally, we looked at boxplots and violin plots, and started to show how to overlay multiple plots in one image.

This workshop will focus less on the different kinds of plots and instead will show how you can modify things to suit your visual preferences. Specifically, we’ll take a closer look at modifying colors, reordering and renaming categorical variables, adding titles, modifying axes, and making changes to legends. It sounds like a lot, but it shouldn’t be too bad. After reading through this workshop, you’ll go from the basic default plots to something you might want to include in a presentaton or paper.

Finally, the next workshop will explore more of the ggplot2 syntax and see how to modify aspects of your plot like the colors and how to reorder things. After that, we’ll dive into more advanced topics, and look at how to change the overall “theme” of your plot, including how to add custom themes to match your powerpoint slides.

Note: I wasn’t able to include everything I wanted to in this workshop. For the sake of simplicity and to ensure the workshop stayed under an hour, I have cut some of the depth in some of these topics. Instead, I have moved this discussion to the supplemental handout. There is no in-person workshop planned in the foreseeable future that covers these topics, but I wanted to make sure you had access to the material.

1.1 Today’s dataset

Last week we looked at McDonald’s menu items. Today, we’ll look at another dataset made available through Kaggle.com that contain nutritional information of 80 different kinds of cereal. I’ve removed some of the columns to simplify the dataset and made it available on my website, so you can read it in straight from there.

cereal <- read.csv("http://www.joeystanley.com/data/cereal.csv")
summary(cereal)
##                         name    mfr    type       shelf      
##  100% Bran                : 1   G:22   C:74   Min.   :1.000  
##  100% Natural Bran        : 1   K:23   H: 1   1st Qu.:1.500  
##  All-Bran                 : 1   N: 6          Median :2.000  
##  All-Bran with Extra Fiber: 1   P: 9          Mean   :2.227  
##  Almond Delight           : 1   Q: 7          3rd Qu.:3.000  
##  Apple Cinnamon Cheerios  : 1   R: 8          Max.   :3.000  
##  (Other)                  :69                                
##      weight          cups            rating         calories    
##  Min.   :0.50   Min.   :0.2500   Min.   :18.04   Min.   : 50.0  
##  1st Qu.:1.00   1st Qu.:0.6700   1st Qu.:32.69   1st Qu.:100.0  
##  Median :1.00   Median :0.7500   Median :40.11   Median :110.0  
##  Mean   :1.03   Mean   :0.8207   Mean   :42.39   Mean   :107.1  
##  3rd Qu.:1.00   3rd Qu.:1.0000   3rd Qu.:50.28   3rd Qu.:110.0  
##  Max.   :1.50   Max.   :1.5000   Max.   :93.70   Max.   :160.0  
##                                                                 
##      sugars         protein           fat          sodium     
##  Min.   : 0.00   Min.   :1.000   Min.   :0.0   Min.   :  0.0  
##  1st Qu.: 3.00   1st Qu.:2.000   1st Qu.:0.0   1st Qu.:135.0  
##  Median : 7.00   Median :2.000   Median :1.0   Median :180.0  
##  Mean   : 7.08   Mean   :2.493   Mean   :1.0   Mean   :163.9  
##  3rd Qu.:11.00   3rd Qu.:3.000   3rd Qu.:1.5   3rd Qu.:215.0  
##  Max.   :15.00   Max.   :6.000   Max.   :5.0   Max.   :320.0  
##                                                               
##      fiber       
##  Min.   : 0.000  
##  1st Qu.: 1.000  
##  Median : 2.000  
##  Mean   : 2.173  
##  3rd Qu.: 3.000  
##  Max.   :14.000  
## 

As you can see we have a relatively simple dataset with a few categorical variables and mostly continuous variables. We’ll use this for our plots today. To get us started, let’s see if the amount of sugar correlates with the cereal’s rating.

library(ggplot2)
ggplot(cereal, aes(sugars, rating)) + 
    geom_point()

Surpringly, the general trend is that the more sugar a cereal has, the lower the rating is. Another way of looking at it is that cereals with zero sugar have the highest rating, between 1 and 7 grams per serving have a medium rating, and 8 or more has a low rating. Correlation is not causation, but the trend is interesting to see.

2 Titles and axes

The first thing we might want to do with a plot is to add a title or change the axis labels. Luckily, this is pretty straightforward.

For a title, all you need to do is add a layer to your plot using the ggtitle function, and then put the title you want to see in quotes.

ggplot(cereal, aes(sugars, rating)) + 
    geom_point() + 
    ggtitle("Cereal rating and amout of sugar per serving")

We can also modify the axes of our plot by adding the xlab or ylab layers and, in quotes, put what you want the axes to be.

ggplot(cereal, aes(sugars, rating)) + 
    geom_point() + 
    ggtitle("Cereal rating and amout of sugar per serving") + 
    xlab("Sugar per Serving") + 
    ylab("Average Rating")

Why would you want to change the axes? Here are the kinds of changes I’ve had to do in the past that you might need to do as well:

  1. When the columns of your data frame are something abbreviated, you can put the full name so the plot is easier to read. So, you can change "mpg" to "miles per gallon").

  2. You may want to add a unit of measurement, so that "height" can be changed to "height (in feet)"

  3. Sometimes your column names are super long. An actual column name I’ve seen is "education_level_by_number_of_years", which could easily be shorted to "education (years)".

  4. Often, all you’ll need to do is change from lowercase to uppercase or vice versa ("sugars" to "Sugars").

  5. R doesn’t like spaces in column names, but you’ll probably want them in your plots, so you can change "serving_size" to "serving size".

With xlab and ylab, you can make these changes so that they’re reflected only in your plot without having to bother renaming parts of your data frame. In other words, these are superficial changes only and don’t actually affect the underlying data object.

Note: For a more detailed discussion about titles, axes, and other labels in your plots, see the supplement to this workshop.

3 Colors

Since colors are perhaps one of the most visible aspects of a plot other than the data itself, we’ll spend a bit of time to make sure you’ve got just the right colors in your plot.

Note: There will be an entire workshop devoted to the use of color in data visualization generally which will discuss more advanced topics related to color theory and perception. See joeystanley.com/r for the latest materials.

3.1 On categorical data

In the previous workshop, we saw how to change all the points by adding the color argument in geom_point itself, or we can add it within the aes() argument to have one color per manufacturer.

ggplot(cereal, aes(sugars, rating)) + 
    geom_point(color = "darkblue")

ggplot(cereal, aes(sugars, rating)) + 
    geom_point(aes(color = mfr))

Later in this workshop, we’ll see how to modify the legend to make it clearer, but for now let’s overlook the fact that it’s not clear what the abbreviations stand for. We’ll get to that in the next section.

So these default colors are good for some purposes. They’re evenly spaced around the color wheel so that they’re maximally distinct. But for lots of reasons, we may want to change them. For example, we might not be satisified with the default color scheme and want to supply our own. We can do so with the function scale_color_manual and as an argument, provide a list of colors. Note that the order of the colors in your legend will be determined by the order you list them in scale_color_manual, so the first one is the top and the last one listed is the bottom. Here’s an example of a not-so-good color scheme:

ggplot(cereal, aes(sugars, rating)) + 
    geom_point(aes(color = mfr)) + 
    scale_color_manual(values = c("yellow", "blue", "green", "red", "orange", "purple"))

Note that the keywords used there ("red", "blue", etc.) are shortcuts that R has provided. To see a full list of the built-in color shortcuts, see this extremely useful document.

Being able to modify colors is useful not only to change all the colors, but it’s also good for highlighting a single category. Let’s switch to a barplot that lists the number of cereal items per manufacturer:

ggplot(cereal, aes(mfr)) + 
    geom_bar() 

If you’re familiar with cereal brands, you might deduce “K” stads for “Kellogg’s”. We can highlight this by making them red and all the others a dark gray. The way this is done is by simply repeating the color names in the list in scale_color_manual:

ggplot(cereal, aes(mfr)) + 
    geom_bar(aes(color = mfr)) + 
    scale_color_manual(values = c("grey25", "red", "grey25", "grey25", "grey25", "grey25"))

Oops! What happened here? Keep in mind that in the previous workshop we learned that to color points in a scatterplot, you need to use color, but for barplots you need to use fill. Consequently, we need to use the related function, scale_fill_manual to accomplish this task:

ggplot(cereal, aes(mfr)) + 
    geom_bar(aes(fill = mfr)) + 
    scale_fill_manual(values = c("grey25", "red", "grey25", "grey25", "grey25", "grey25"))

Of course, if you don’t like the way this looks, you can always modify the color argument still (as well as the size of the outline and width of the bars) and everything will behave as you expected.

ggplot(cereal, aes(mfr)) + 
    geom_bar(aes(fill = mfr), color = "black", size = 2, width = 0.75) + 
    scale_fill_manual(values = c("grey25", "red", "grey25", "grey25", "grey25", "grey25"))

Being able to highlight a single column is a very effective way at using your visual to tell a story. It draws viewers’ attention exactly where you want it to go, and it minimizes other irrelevant or distracting aspects of the plot. Learning how to highlight individual categories is a great skill to have.

Note: For a more elegant solution to higlighting individual columns in a barplot, particularly if you have lots of columns, see the supplement to this workshop.

Now you try!

The challenge

  1. Try putting in different variables for the scatterplot and see what kinds of things you discover about cereal. When you find something interesting, perhaps about a single company, highlight just the dots for that manufacturer. Feel free to modify other aspects of the plot as you see fit.

  2. Just as you can manually change the colors using scale_color_manual, you can also change the size of the points using scale_size_manual. In fact, you can change any of the aspects we’ve seen so far using scale_*_manual where the * is the part you’re trying to change. To use these, you have to use those properties in the aes() function because you’re overriding the defaults. In other words, if you want to highlight a particular manufacturer’s size dot, you have to have size = mfr in geom_point. Try making a barplot that uses some of these additional overriding functions.

The solution

Here’s something I found interesting. There’s a slight trend such that the more protein a cereal has, the higher its rating is. But it seems like Nestle cereals are at the top of the pack while General Mills are lower.

ggplot(cereal, aes(protein, rating)) + 
    geom_point(aes(color = mfr)) + 
    scale_color_manual(values = c("blue3", "darkgrey", "forestgreen", 
                                  "darkgrey", "darkgrey", "darkgrey"))

Of course, this may just be that Nestle cereals have higher ratings than General Mills across the board. You may have to do some more plots to see for sure.

For the barplot, I’ve went ahead and modifed the fill, outlines color, and size of the bar corresponding to Kellogg’s and I added General Mills to for fun. I wasn’t able to modify the width manually, so I kept that the same. Also, I set outline width to zero to remove it entirely on the other four

ggplot(cereal, aes(mfr)) + 
    geom_bar(aes(fill = mfr, color = mfr, size = mfr), width = 0.75) + 
    scale_fill_manual(values = c("blue3", "red", "grey25", 
                                 "grey25", "grey25", "grey25")) + 
    scale_color_manual(values = c("darkblue", "firebrick4", "black", 
                                  "black", "black", "black")) + 
    scale_size_manual(values = c(2, 2, 0, 0, 0, 0))

Using these scale_*_manual functions, it becomes easy to change many aspects of your plot. You can do this to change color schemes, or highlight particular points.

3.2 On continuous data

What we’ve seen so far is how to change colors for categorical variables. By default, it uses a black to light blue color scheme.

ggplot(cereal, aes(sugars, rating)) + 
    geom_point(aes(color = rating))

If you’re not a fan of that, the way we change this is similar to how we changed the categorical variables. Since we’re working with numerical data, we use scale_color_gradient instead. Here, we can specify what the color for the lowest and highest values should be. R will automatically sort of fill in the gaps and create a nice color gradience for you.

ggplot(cereal, aes(sugars, rating)) + 
    geom_point(aes(color = rating)) + 
    scale_color_gradient(low = "red", high = "goldenrod2")

So it’s easy to use whatever colors you want. Personally though, I sometimes have a hard time choosing good colors, because the gradientnt colors between them are sometimes not what I expected. (Try "red" to "blue" in the previous plot: the result is not great.) For this reason, I usually resort to other people’s color palettes to make my plots look good.

3.3 Other color palettes

When I create my own color schemes, the colors I try are usually too harsh. Fortunately, smart people (particularly cartographers) have developed color schemes that are easy on the eyes. Some of them are good for printing black-and-white and are color-blind friendly as well. I’ll discuss my three favorite ways to get good-looking colors: scale_color_brewer, scale_color_ptol, and scale_color_scico.

3.3.1 Color Brewer

The first one is the color brewer. These palettes were designed by mapmakers and are intended for map data. But they still look good even on other kinds of plots. You can see these schemes by going to colorbrewer2.org.

Luckily for us, ggplot2 automatically comes with built in functions to work with these themes, scale_color_brewer and scale_fill_brewer. If you’re working with categorical variables, you’ll need to specify type = "qual" (for qualitative). You can use the default palette ("Accent"), or you try some of the other ones (such as "Dark2", "Pastel1", "Pastel2", "Set1", "Set2", or "Set3"). These colors tend to be a bit lighter, so they’ll pop out better when we learn how to make the background white later in this workshop instead of default gray.

ggplot(cereal, aes(mfr)) + 
    geom_bar(aes(fill = mfr), color = "grey25") + 
    scale_fill_brewer(type = "qual", palette = "Pastel1")

For continuous variables, the process is similar, but you need to use scale_color_distiller or scale_fill_distiller instead. You’ll need to specify type = "seq" (for sequential) and use one of the many types they have available. The mutli-hue options are "BuGn", "BuPu", "GnBu", "OrRd", "PuBu", "PuBuGn", "PuRd", "RdPu", "YlGn", "YlGnBu", "YlOrBr", and "YlOrRd" and the single-hue options are "Blues", "Greens", "Greys", "Oranges", "Purples", and "Reds".

ggplot(cereal, aes(sugars, rating)) + 
    geom_point(aes(color = rating)) + 
    scale_color_distiller(type = "seq", palette = "YlGnBu")

The downside to using the Scale Brewer colors is that it’s a bit harder to modify them manually once you use them. It’s possible, but it’ll probably involve going to their website and copy and pasting the hexidecimal color codes and passing them into scale_color_manual.

3.3.2 Scico

Another set of palettes I like is from the scico package (which stands for “Scientific Color Maps”) by Thomas Lin Pedersen and Fabio Crameri. Many color themes are not ideal because human eyes perceive relative distance between colors differently than computers do, especially when you consider color blindness. This blog post summarizes what it can do.

Because this is a separate package, you’ll need to install (install.packages("scico")) and load it (library(scico)), but otherwise it’s pretty easy to implement. To see the list of palettes, you can use thte scico_palette_show function.

library(scico)
scico_palette_show()

The ones I personally like the most (and have used in the past) are Berlin and Cork for diverging scales and Oslo for continuous scales.

ggplot(cereal, aes(sugars, rating)) + 
    geom_point(aes(color = rating)) + 
    scale_color_scico(palette = "oslo")

ggplot(cereal, aes(sugars, rating)) + 
    geom_point(aes(color = rating)) + 
    scale_color_scico(palette = "berlin")

3.3.3 ggthemes

In addition to the Color Brewer and Scico palettes, there’s a whole bunch of additional options in the ggthemes package. Let’s download this down and start exploring some of them.

install.packages("ggthemes")
library(ggthemes)

One of the many smart people out there that have thought about colors more than I have is Paul Tol. You can see his philosphy and custom color palette that he’s done here. His colors are carefully chosen so that they are color-blind-friendly, print- and photocopy-friendly, and generally go well together. The colors from my dissertation all come from Paul Tol’s colors.

With the ggthemes package, you can use these colors by using scale_*_ptol:

ggplot(cereal, aes(mfr)) + 
    geom_bar(aes(fill = mfr), color = "grey25") + 
    scale_fill_ptol()

In addition to Paul Tol’s colors, the ggthemes package comes other palettes as well. Here aare just some of the other ones you might want to try:

  • scale_color_few: Colors from Stephen Few’s Practical Rules for Using Color in Charts.

  • scale_color_pander is particularly useful for making sure your colors are colorblind friendly. See Color Universal Design (CUD) by Masataka Okabe and Kei Ito for more information.

  • scale_color_fivethirtyeight: The website fivethirtyeight.com has its own colors. This function mimics those.

  • scale_color_stata: If you came over to R from Stata and you really like how those plots look, you can use the same colors they do.

  • scale_color_wsj: This uses some of the color palettes that the Wall Street Journal uses.

You can see the complete list of the themes included in this package, as well as examples of when they’re used, on the package’s documentation page.

3.3.4 Wes Anderson

Are the standard themes too mainstream for you? Are you a fan of the artistic style of Wes Andersen? You’re in luck! A whole set of color palettes have been derived from some of Wes Anderson’s work, and data scientist Karthik Ram has wrapped them up into an R package, wesanderson. It’s not included in ggthemes, so you’ll have to download it separately. The package is hosted on github, so you’ll need to do a little extra work to get it installed:

install.packages("devtools")
devtools::install_github("karthik/wesanderson")

After that, you can use it like normal:

library(wesanderson)

Now that you’ve got that, you have the wes_palette at your disposal:

ggplot(cereal, aes(mfr)) + 
    geom_bar(aes(fill = mfr), color = "grey25") + 
    scale_fill_manual(values = wes_palette("IsleofDogs1"))

See Ram’s description page for the full list of palettes. Be aware that most of them have relatively few colors (5 or less) so they’ll only work if you also have relatively few categories to color.

Your turn!

The challenge

Try using a categorical color scheme in scale_color_distiller and a continuous theme in scale_color_brewer and see what happens.

The solution

Fortunately, the folks at Color Brewer have made it so the continuous color schemes look great, even on categorical data. With six categorical variables, it’s spread a little thin within the color scheme, but they’re still distinct.

ggplot(cereal, aes(mfr)) + 
    geom_bar(aes(fill = mfr), color = "grey25") + 
    scale_fill_brewer(type = "qual", palette = "Greys")

Going the other way, ggplot2 is smart and will take the discrete colors of a categorical theme and fill in the gaps to create a continous color scheme.

ggplot(cereal, aes(sugars, rating)) + 
    geom_point(aes(color = rating)) + 
    scale_color_distiller(type = "seq", palette = "Set2")

This is what they do on weather maps by the way. As it goes from cold to hot, the colors go from like a white to blue to green to red. Here it’s a bit unneccesary, but it might be useful for you and your data.


3.4 Final remarks on color

So that’s how you can modify colors in your plots. Colors are one of the most important parts of your plot, other than the data itself. A good use of color schemes can really make a plot look great so it’s worth the time to learn to use them well. As you can tell, I’ve got a lot to say about color, so be sure to come to the workshop on colors later in the semester.

4 Renaming and reordering

Reordering things is a pretty simple process, as we’ll see in this section, but renaming is a bit tricker. The way to rename variables is by modifying the dataframe itself and then feed this new dataframe into ggplot. For this, we’ll use some functions from the forcats package, which is part of the tidyverse suite of packages.

If you haven’t installed it already, you’ll need to install and load forcats. (If you’ve already installed tidyverse than that’s already been taken care of.)

install.packages("forcats") # If you haven't already

We can use fct_recode to change all those abbreviations. Note that the old name is on the right and the new name is on the left:

# Load the package
library(forcats)

# Make a copy again
cereal_renamed <- cereal

# Modify them like this.
cereal_renamed$mfr <- fct_recode(cereal_renamed$mfr,
                                 "Kellogg's" = "K",
                                 "General Mills" = "G",
                                 "Post" = "P",
                                 "Quaker Oats" = "Q",
                                 "Ralston Purina" = "R",
                                 "Nabisco" = "N")

# Plot it
ggplot(cereal_renamed, aes(mfr)) + 
    geom_bar()

Now we have different names in the plot, which ultimately makes it much easier to read and interpret.

What you may want to do at this point is change the order. By default, R will order your categorical variables alphabetically. This is a good default, but you might want to change that in your own plot. What we can do is modify the mfr column so that it’s a factor with levels in the order that you specify. If you want to do it by hand, it’s not too much work with only six companies:

cereal_ordered <- cereal_renamed
cereal_ordered$mfr <- factor(cereal_ordered$mfr, 
                             levels = c("Kellogg's", "General Mills", "Post", 
                                        "Quaker Oats", "Ralston Purina", "Nabisco"))

An easier way would be to use the fct_infreq function from forcats. This puts the manufacturers in order of frequency. This is especially handy if you have lots of categories to plot or you just don’t want to have to type them all out. (Obviously this option won’t work if you need a specific order that can’t be computed.)

cereal_ordered <- cereal_renamed
cereal_ordered$mfr <- fct_infreq(cereal_ordered$mfr)

Either way, when we needed to rename something, we took care of it by modifying the dataframe itself and then sending it off to ggplot. We can now plot our now-modified dataset and the columns will show up in the right order.

ggplot(cereal_ordered, aes(mfr)) + 
    geom_bar() 

That’s really all there is to it. If you don’t like the order of things, it’s pretty easy to change it once you’ve got the code.

5 Legends

A critical part of data visualization is the legend. Sometimes the legend is the key to understanding the plot; other times it’s completely uncessary. We’ll look at some examples of each and see how we can improve the visualization.

5.1 Removing the legend

Let’s take a bar plot of the manufacturers and color it using Color Brewer.

ggplot(cereal_ordered, aes(mfr)) + 
    geom_bar(aes(fill = mfr), color = "grey25") + 
    scale_fill_brewer(type = "qual", palette = "Set2")

Look closely: what purpose does the legend serve here? All the information from the legend itself is already contained in the plot. The legend adds nothing to the visual. Fortunately, it’s easy to remove the legend using the theme function. theme is actually a monster of a function with dozens and dozens of arguments because it’s the key to modifying pretty much everything on the plot. We’ll get to that in the next workshop (or look at the supplementary handout.). For now, if we set legend.position argument to "none" it’ll remove the legend.

ggplot(cereal_ordered, aes(mfr)) + 
    geom_bar(aes(fill = mfr), color = "grey25") + 
    scale_fill_brewer(type = "qual", palette = "Set2") + 
    theme(legend.position = "none")

This not only removes the legend, but it actually stretches out the plot a little bit to fill the horizontal space.

Alternatively, instead of using "none", you can also change the legend’s location by using "top", "bottom", or "left". But for this plot, it’s probably best to just remove it entirely. (In fact, we can probably remove the colors since they don’t actually serve a purpose, but that’s up to you.)

5.2 Modifying the legend order

Previously, we saw how to modify the order that the manufacturers appeared in the barplot. When we make changes to the underlying data, the legend automatically updates. We can actually modify the legend independently of the data though, and in some cases, it’s quite useful. Let’s take a scatterplot I did earlier that highlighted General Mills and Nabisco. This time we’ll use the cereal_renamed object so we can get the actual names.

ggplot(cereal_renamed, aes(protein, rating)) + 
    geom_point(aes(color = mfr)) + 
    scale_color_manual(values = c("blue3", "darkgrey", "forestgreen", 
                                  "darkgrey", "darkgrey", "darkgrey"))

Here’s a good example of when we might want to reorder the legend. One option is to use the technique demonstrated in the Renaming and reordering section above. If we go that route, we’ll also need to modify the order of the colors in scale_color_manual since now the forest green one is second. Here’s how that’s done:

cereal_ordered2 <- cereal_ordered
cereal_ordered2$mfr <- factor(cereal_ordered2$mfr, 
                             levels = c("Nabisco", "General Mills", "Kellogg's", "Post", 
                                        "Quaker Oats", "Ralston Purina"))
ggplot(cereal_ordered2, aes(protein, rating)) + 
    geom_point(aes(color = mfr)) + 
    scale_color_manual(values = c("forestgreen", "blue3", "darkgrey",
                                  "darkgrey", "darkgrey", "darkgrey"))

The problem with this option is that we either have to 1) modify the original dataset (cereal), which is sometimes something you don’t want to do, or 2) keep track of several very similar objects (cereal_renamed, cereal_ordered and now cereal_reordered2). Neither are particularly elegant.

A slightly better alternative is to add another argument to scale_color_manual, breaks, and simply list the order we want to see. The tricky part about this is that the order of colors in the values list has to match the original order. Specifically, Nabisco comes thrid in the original order, so it’s "forestgreen" color should come third in values; however, I want it to show up first, so I’ll put "Nabisco" first after breaks. The reason for this is that the breaks makes a superficial change, but the order stays the same under the hood.

ggplot(cereal_renamed, aes(protein, rating)) + 
    geom_point(aes(color = mfr)) + 
    scale_color_manual(values = c("blue3", "darkgrey", "forestgreen", 
                                  "darkgrey", "darkgrey", "darkgrey"),
                       breaks = c("Nabisco", "General Mills", "Kellogg's", "Post", 
                                  "Quaker Oats", "Ralston Purina"))

This is admittedly a bit annoying, but you only have to worry about it once and then it’ll always be the same after that in your plots. Moral of the story, there’s no simple solution to reordering how you want.

5.3 Modifying the legend text

One super annoying part of about is that we see the abbreviation “mfr” as the title of the legend. That’s the name of the column in our spreadsheet, and it’s handy to have it short because, well, less typing. But in a professional visualization, you’ll probably want to spell it out.

You can modify the original data, but it might just be easier to make a superficial change. This can happen in the scale_color_manual function again, with the argument name.

ggplot(cereal_renamed, aes(protein, rating)) + 
    geom_point(aes(color = mfr)) + 
    scale_color_manual(name = "Manufacturer",
                       values = c("blue3", "darkgrey", "forestgreen", 
                                  "darkgrey", "darkgrey", "darkgrey"),
                       breaks = c("Nabisco", "General Mills", "Kellogg's", "Post", 
                                  "Quaker Oats", "Ralston Purina"))

So now we have three arguments in scale_color_manual: name, values, and breaks. The order that they appear makes no difference. But I like to at least put the name at the top since it appears at the top in the final plot.

We can also modify the legend labels. We saw how to change the underlying data above in the Renaming and reordering section. But just as you sometimes want to change and order just for the purpose of a single plot, you can rename things superficially while keeping the underlying data intact. To illustrate this, I’ll use the original cereal dataset, which has the one-letter abbreviations for the manufacturers, and I’ll change them in the plot (and I’ll make them lowercase to show that they’re different). To accomplish this, I add the labels argument to the other three.

# Plot with no changes to the legend (other than color)
ggplot(cereal, aes(protein, rating)) + 
    geom_point(aes(color = mfr)) + 
    scale_color_manual(values = c("blue3", "darkgrey", "forestgreen", 
                                  "darkgrey", "darkgrey", "darkgrey"))

# Plot with all changes in title, order, labels, and colors
ggplot(cereal, aes(protein, rating)) + 
    geom_point(aes(color = mfr)) + 
    scale_color_manual(name = "Manufacturer",
                       breaks = c("N", "G", "K", "P", "Q", "R"),
                       labels = c("nabisco", "general mills", "kellogg's",
                                  "post", "quaker oats", "ralstom purina"),
                       values = c("blue3", "darkgrey", "forestgreen", 
                                  "darkgrey", "darkgrey", "darkgrey"))

Thus, with a little bit of typing, you can modify whatever you want with the legend.

Your turn!

The challenge

Everything about manually changing things in the legend in scale_color_manual also applies to scale_fill_manual and scale_fill_brewer. Try making barplot with the original cereal dataset but manually modify the name, order, and labels (and colors unless you use Color Brewer). The result should be such that the legend is in your custom order but bars are still in alphabetical order (with one-letter abbreviations underneath). Not a useful plot, but a useful exercise.

The solution

I decided to use scale_fill_brewer, so I didn’t need to manually set the colors.

ggplot(cereal, aes(mfr)) + 
    geom_bar(aes(fill = mfr), color = "grey25") + 
    scale_fill_brewer(type = "qual", palette = "Pastel2",
                      name = "Manufacturer",
                      breaks = c("P", "Q", "R", "G", "N", "K"),
                      labels = c("post", "quaker oats", "ralstom purina", 
                                 "general mills", "nabisco", "Kellogg's"))

5.4 Credit to R Cookbook

Much of the material from this section was borrowed from Winston Chang’s http://www.cookbook-r.com, specifically the page on legends in ggplot2. I refer to this page all the time in my own research, and there’s so much more that is covered there that I couldn’t get to here. Be sure to check it out.

6 Final Remarks

I hope this workshop has at least got you interested in using ggplot2. I know there’s a lot to learn, but there are patterns in the code, making it somewhat easy to try new things. Visualizing data is a key component of data analysis: sometimes it’s for your own consumption so that you can understand your own data and sometimes its for public consumption such as in a presentation or a paper. ggplot2 is a very good solution that should solve most people’s needs and with it you can soon be able to create stunning visualizations with your data.

Now that you have two workshops’ experience, you’ve seen that ggplot2 is a workhorse and can do a lot things. In this workshop we saw how to change the big things in your plots (titles, axes, colors, names, orders, legends, facets, themes) and how it’s relatively straightforward to make these changes. With these tools under your belt, you’re on your way to making some really professional-looking visualizations.