Introduction

This is the second in a two-part series of Tidyverse workshops. In the last one, which you can see here, we looked at three main topics:

  • Getting data in and out of R

  • Tidying columns (including reordering, renaming, adding, removing, and modifying)

  • Filtering

These are very important things you should be relatively comfortable with, regardless of what you use R for. They’re also relatively easy topics for you to grasp because, for the most part, you probably know how to do them just as well in Excel. In fact, you might be wondering why bother learning R and tidyverse functions when they can just do what Excel does…

Today’s workshop covers some more advanced topics that are a little more difficult to conceptualize and, crucially, are much more difficult to do in Excel. In fact, I don’t know if Excel can even do some of the stuff that we’ll be learning today—and we’ll be able to do it in just one line of code! The three main topics are merging, summarizing, and reshaping your data. It took me a while to get the hang of them, but boy once you do they sure are some sweet skills to have up your sleeve.

As always, before we get into anything, let’s load the tidyverse package.

library(tidyverse)

Data

While I generally never watch sporting events, for some reason I go crazy during the Olympics and watch way more than I should. Given that the Winter Olympics just ended, I thought we could take a look at data from previous Winter Olympics. Today we’ll work with data from all the Winter Olympics from the first one in 1924 until the second most recent one in Sochi in 2014. This data was originally downloaded from Kaggle.com and then I modified it a little bit to make it appropriate for this workshop.

The data is stored in three separate spreadsheets, which you can download directly from my website using the links below.

The first is a spreadsheet of all the events that occurred each year. It contains the following columns:

  1. Year: The year the event took place. Some events have been discontinued and others added, so the list of specific events changes from year to year. For example, Military patrol ended in 1948 while mixed doubles curling was added in 2018.

  2. Sport: The broad terms for the different sports in the games (Biathlon, Bobsleigh, Curling, Ice Hockey, Luge, Skating, and Skiing)

  3. Discipline: A more narrow term within each sport. For example, Skeleton and Bobsleigh are disciplines within the sport Bobsleigh, or Figure Skating, Short Track Speed Skating, and Speed Skating are disciplines within the sport Skating.

  4. Event: The name of the specific event that an athlete can medal in. So within the Figure Skating discipline, there are four events: Ice Dancing, Individual, Pairs, and Team.

Unfortunately, with the data that I could find, I don’t have information on mixed events, such as Mixed Curling or Mixed Relay Biathlon, which have both men and women on a team. However, this can mostly be inferred by the gender of the athlete and the name of the event, which tend of have the word mixed in it (Mixed Relay Biathlon).

Let’s go ahead and read that data in so we can get an idea of what we’re looking at.

events <- read_csv("http://joeystanley.com/data/events.csv")
events

The next dataset has information about all the athletes that got at least one medal. This spreadsheet has six straightforward columns containing the person’s name, country, gender, discipline, event, year, and medal.

athletes <- read_csv("http://joeystanley.com/data/athletes.csv")
athletes

Finally, let’s read in data about the locations each year. This is a spreadsheet that has columns for the year, city, country, and continent where the games happened. Let’s read that in now.

years <- read_csv("http://joeystanley.com/data/years.csv")
years

Okay, so we have three spreadsheets that are all kinda related. So without further ado, let’s get to combining them.

Joining datasets

But first, why bother have all this data stored in different spreadsheets in the first place? Why not have a single file that contains all the athletes info, the events, and the location? The reason for it is because splitting them into three reduces the amount of redundant information in each spreadsheet.

Let’s say we had a spreadsheet of all the athletes names, years, and city where the Olympics were held. We would have 5,770 rows and three columns. But, in the column with the city would be really redundant. You’d have Sochi, South Korea, and Asia there hundreds of times, right next to a 2014. If we split the file into two, with the athlete name and year in one, and then a smaller one with just the year and city in another, we only need to type Sochi once. Yes, we repeat the year, but the number of repeats this way is far fewer than the repeats in a monster spreadsheet. In the end, you can still easily recover the data, and the overall file size of of the spreadsheets is much smaller than the size of a giant one.

Merging datasets might remind you of databases. In fact, they’re essentially the same thing! In a database, you have two or more spreadsheets that are linked in some way, usually by some sort of key identifier. With the various *_join functions in dyplr, you can connect datasets in a way very reminiscent of how you might do it in database software.

left_join and right_join

So let’s start by taking the athlete data and adding the city names. To get an idea of how we might merge them, let’s look at them one more time:

athletes
years

What do these two spreadsheets have in common? Well, they both have a column with the years. Coincidentally, in both spreadsheets, this column is named Year. (Identical names across spreadsheets, while not required, do make this kind of work easier.) So, if we want to add the city name to the athlete data, we can use the left_join function. I’m going to just select the relevant columns right now to make it clearer.

athletes %>%
    select(Athlete, Year) %>%
    left_join(years, by = "Year")

So, what we’ve done is merged the two datasets. Wherever a year was found in the Year column of the Athlete dataset, it added the city name from the years dataset. That’s why we had to specify the argument by = "Year". If you’ve ever used lookup tables in Excel, this is essentially what we’re doing.

So why is it called left_join? There’s a more technical explanation in §13.4.4 of R for data Science, but it essentially means we’re adding information to the dataframe on the left (or, in this case, the data frame that’s being piped into the function). If there’s a year in the left dataframe (athletes) that is not found in the right one (years), you’ll see an NA in the new City, Country, and Continent columns when they’re combined. However, if there’s a year in the years dataset that is not in the athletes dataframe, no harm done and it’ll be excluded from the combined dataset.

You can think of the years dataset as a dictionary, and the Year column in the athletes dataset as the stuff we’re looking up. The cool part is that rather than just retrieving one piece of information, if the year dataset had more columns, all of them would be appended on to the combined one.

Since there’s a left_join, there is also a right_join, which does the logical opposite. We could get the exact same spreadsheet using right_join by moving some things around:

years %>%
    right_join(athletes, by = "Year")

The order of the columns are a little bit different, but the data is all there. In my own code, I don’t use right_join as much, I guess because conceptually I like left_join better, but it’s completely up to you.

So what happens if you do the opposite? Let’s do a left_join with years on the left and athletes on the right.

years %>%
    left_join(athletes, by = "Year")

Turns out, we get essentially the same thing! Just with a couple rows with NAs at the top. The reason for why they’re very similar is because the years in both spreadsheets nearly perfectly match each other. They’re not perfect (the years data has 2018 and 2022 in them, but we have no information about those years in the athletes data).

So what makes left_join and right_join different is how they treat missing data. To illustrate, let’s create subsets that are more drastically different. First, let’s say we only remember the cities for the last seven Winter Olympics:

last_7_cities <- years %>%
    filter(Year >= 1998)
last_7_cities

And then let’s say we only had information on the women who medaled in individual figure skating in the nineties (1992, 1994, 1998):

skaters_90s <- athletes %>%
    filter(Year >= 1990, Year <= 2000, 
           Discipline == "Figure skating", Event == "Individual", 
           Gender == "Women")
skaters_90s

So we have two datasets that overlap partially, but not completely. Let’s see what happens when we join them. Let’s try to add the city name to the athletes’ information using left_join, where the skaters_90s data comes first:

left_join(skaters_90s, last_7_cities, by = "Year")

If you look through the City column, you’ll notice that Nagano is the only one there, and the rest of the rows have NAs. So it kept all the skaters’ information, and only added the city if it was in the “dictionary.” Now let’s try a right_join:

right_join(skaters_90s, last_7_cities, by = "Year")

So now things look different. Because the last_7_cities dataframe was the “main” one, it kept all the data in it, specifically the year. That includes six years for which there was no athlete data. So, in the first six rows, we have NAs in all the columns except Year. But then, it has all the athletes for the 1998 year for which we have data because that was the only one that overlapped between the two.

The moral of the story is that if you’ve got very clean data where the info from one perfectly matches the other, as far as I can tell there’s no substantial difference between left_join and right_join other than the order is different (which you can quickly change with arrange). However, if there is a mismatch, the two functions are very different and you have to think about what you want your result to be like.

inner_join and full_join

Okay, so what if you’re aware of mismatches between your spreadsheets. Likely, there will be. And you know there’re going to be some NAs. Is there a way to remove them?

Sure! That’s what inner_join is for! Let’s do exactly what we did above, but using inner_join instead of right_join.

inner_join(skaters_90s, last_7_cities, by = "Year")

Here, all we get are the figure skaters only from 1998, because that’s the only year that overlaps between the two datasets. If we do inner_join with the two datasets reversed, as far as I’m aware the result is the exact same but with a different order to the columns:

inner_join(last_7_cities, skaters_90s, by = "Year")

You might expect there to be an outer_join function which would keep just the athletes whose years are not in the years dataframe and the years from the last_7_cities data frame with no athletes. However, such a function does not exist as far as I’m aware, and honestly I can’t think of a case where this might be useful.

However, there is full_join, which will keep everything from both.

full_join(skaters_90s, last_7_cities, by = "Year")

This dataframe has 15 rows because it keeps all the athletes from skaters_90s (and puts NAs for those whose years are not in the last_7_cities data) and the remaining six years that have no representation in the skaters_90s dataframe. I have not needed to use this particular function, but you might find it useful.

Using joins to filter data

There are two more join functions that are pretty slick. They take a sec to wrap your mind around, but once you get them they’re really nice to be aware of.

The first is semi_join. This filters the data such that only rows in skaters_90s that have a match in last_7_cities are kept.

semi_join(skaters_90s, last_7_cities, by = "Year")

This is slightly different from inner_join because the City, Country, and Continent columns are not in the resulting dataframe. So it really is just a filter. Another way to do this is with %in%, if that makes more sense conceptually.

skaters_90s %>%
    filter(Year %in% last_7_cities$Year)

The opposite of this is anti_join. This returns all the rows of the first dataframe that do not have a match in the second. In other words, this shows us the data in the skaters_90s dataframe from 1994 and 1992 because last_7_cities does not have data for 1994 and 1992.

anti_join(skaters_90s, last_7_cities, by = "Year")

This anti_join is super handy. There have been many times where I’ve needed to compare two similar datasets that were each pretty big, but I knew there were a few discrepancies. With anti_join I could isolate those with just a single line of code whereas some other way would have been a lot more work.

Your turn!

The challenge

The solution

Summarizing

Working with large datasets is great, but sometimes we want to summarize what’s going on. In this section, we look at the summarize function, especially in conjunction with group_by, which will allow us to create some new summarized versions of your data.

Let’s start with our athletes dataframe. If we just peek at the first twelve lines, we can already see that some athletes compete in multiple years and/or across different events.

athletes