library(tidyverse)
library(ggthemes)
library(gganimate)
library(mgcv)
library(itsadug)
Last week, I presented some work that Lisa Johnson and I have been working on. We discussed ways that vowel formant trajectories can help us learn more about vowel mergers. What seemed to get the most attention though were the data visuals that I created for the talk.
Just when I thought I couldn’t be more impressed with @joey_stan ’s data viz game, he pulls out these animated vowel trajectories! #LSA2022 pic.twitter.com/ZzKqvOWoR0
— Nicole Holliday (@mixedlinguist) January 7, 2022
Here’s what that image looks like up close:
My friend Andrew Bray recommended I do a tutorial on how to do them:
After @joey_stan teaches us all how to do it, trajectory gifs need to become the norm. #LSA2022
— Andrew Bray (@arbray01) January 7, 2022
I haven’t done a cool tutorial for a while and I thought it might make for a good one. So, in this blog post, I’ll show you my step-by-step process for how I made that animation. It’s a little long because I’ve tried to explain the reasoning for why I did what I did. I guess it just shows how much work goes into a nice visual I suppose. Anyway, hopefully you can make it too with your own data!
Load the data
I’ll start by loading the data that I used in the presentation.
<- read_csv("lsa2022.csv") %>%
heber print()
# A tibble: 8,409 × 9
# Groups: token_id [807]
token_id speaker word allophone percent F1_norm F2_norm yob start_event
<chr> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <lgl>
1 UT008-Jani… UT008-… kneel ZEAL 0 0.464 2.82 1958 TRUE
2 UT008-Jani… UT008-… kneel ZEAL 0.1 0.434 2.88 1958 FALSE
3 UT008-Jani… UT008-… kneel ZEAL 0.2 0.478 2.81 1958 FALSE
4 UT008-Jani… UT008-… kneel ZEAL 0.3 0.524 2.27 1958 FALSE
5 UT008-Jani… UT008-… kneel ZEAL 0.6 0.619 1.11 1958 FALSE
6 UT008-Jani… UT008-… kneel ZEAL 0.8 0.387 0.960 1958 FALSE
7 UT008-Jani… UT008-… kneel ZEAL 0.9 0.392 1.04 1958 FALSE
8 UT008-Jani… UT008-… kneel ZEAL 1 0.631 1.30 1958 FALSE
9 UT008-Jani… UT008-… built GUILT 0.1 0.430 2.39 1958 TRUE
10 UT008-Jani… UT008-… built GUILT 0.2 0.455 2.36 1958 FALSE
# ℹ 8,399 more rows
This data has already been processed a little bit. Formants were extracted using FastTrack, and from there I did a bit of cleaning and normalization and whatnot. For simplicity in this blog post, I’ll only work with the allophones ZEAL and GUILT (which are /il/ and /ɪl/, respectively). To make it possible to run the GAMM, I’ll then convert charactater vectors to factors, and I’ll add the start_event
column. For more information on prepping for GAMMs, see Sóskuthy (2017).
Fit the GAMMs
Now it’s time to actually fit the GAMMs. This one is, relatively speaking, not too complicated. First, the dependent variable is the normalized F1 measurements. As predictors, I have allophone
as a fixed effect which allows the model to have each allophone’s predicted measurements be at different heights (in normalized Hz). I then have percent
as a smooth by allophone
, which allows the model to fit a curve from the onset to the offset of the vowel, independently for each allophone. I also have a smooth for birth year (yob
) by allophone
, which allows the model to fit a curvey from the earliest birth year to the latest birth year, independently for each allophone. The next line (with the ti()
function), is the juicy one: it’s the interaction between percent
and birth year (yob
), which allows the curved line fit from the onset to the offset of the vowel to vary freely as it progresses across birth years. The last two rows include speaker and word as random smooths, by allophone.
I’ll fit the exact same model twice: once for F1 and again for F2. To avoid repetitive code and potential human error, I’ll use update
to change the dependent variable of the model.
<- bam(F1_norm ~ allophone +
f1_gamm s(percent, by = allophone) +
s(yob, by = allophone) +
ti(percent, yob, by = allophone) +
s(speaker, percent, by = allophone, bs = "fs", m=1) + # I removed xt = "cr",
s(word, percent, by = allophone, bs = "fs", m=1),
data = heber,
discrete = TRUE)
<- update(f1_gamm, F2_norm ~ .) f2_gamm
To avoid autocorrelation in the data, I’ll get the rho values from each model and rerun them with this rho parameter. See Sóskuthy (2017) for more information.
<- start_value_rho(f1_gamm)
f1_gamm_rho <- start_value_rho(f2_gamm)
f2_gamm_rho
<- update(f1_gamm, rho = f1_gamm_rho, AR.start = heber$start_event)
f1_gamm <- update(f2_gamm, rho = f2_gamm_rho, AR.start = heber$start_event) f2_gamm
Okay, now that we’ve got some GAMMs fit to the data, we can extract predicted measurements. Here’s where where use the itsadug
package. I’ll get predicted measurements for both ZEAL and GUILT, for every year from 1920 to 2000, and—here’s the kicker—at a thousand equidistant points within the vowel. Actually, 1001 to avoid fencepost errors reasons, and after chopping off the first and last 10% of the vowel to avoid consonantal effects. I’ve found that kind of fidelity is what makes for some really smooth-looking plots. It also makes so that it takes a several minutes to extract predicted measurements.
<- get_predictions(f1_gamm,
f1_preds cond = list(
allophone = c("ZEAL", "GUILT"),
yob = 1920:2000,
percent = seq(0.1, 0.9, length = 1001)),
rm.ranef = TRUE,
print.summary = FALSE)
<- get_predictions(f2_gamm,
f2_preds cond = list(
allophone = c("ZEAL", "GUILT"),
yob = 1920:2000,
percent = seq(0.1, 0.9, length = 1001)),
rm.ranef = TRUE,
print.summary = FALSE)
At this point, we’re ready to combine the two datasets and do a little bit of light processing.
<- bind_rows(`F1` = f1_preds, `F2` = f2_preds, .id = "formant") %>%
preds select(-rm.ranef, -speaker, -word) %>%
rename(hz_norm = fit)
head(preds)
formant allophone percent yob hz_norm CI
1 F1 ZEAL 0.1000 1920 0.4013465 0.05088054
2 F1 GUILT 0.1000 1920 0.4853472 0.05443935
3 F1 ZEAL 0.1008 1920 0.4012991 0.05085728
4 F1 GUILT 0.1008 1920 0.4856286 0.05441697
5 F1 ZEAL 0.1016 1920 0.4012518 0.05083414
6 F1 GUILT 0.1016 1920 0.4859101 0.05439473
We now have 324,324 rows. The reason it’s so huge is because there are 1001 predicted formant measurements, for each year from 1920 to 2000, for each formant, for each allophone. That’s a lot of data.
Basic plots
Okay, now we’re ready for some visuals. As a sanity check, let’s plot this data in the F1-F2 space. That means I’ll quickly have to reshape the data so that F1 and F2 are in separate columns. I’ll use the predicted measurements from 1920 for this plot.
pivot_wider
on vowel data.%>%
preds pivot_wider(names_from = formant, values_from = c(hz_norm, CI)) %>%
filter(yob == 1920) %>%
ggplot(aes(hz_norm_F2, hz_norm_F1, color = allophone)) +
geom_path(arrow = joeyr::joey_arrow()) +
scale_x_reverse() +
scale_y_reverse() +
theme_bw()
Okay, looks decent enough. Let’s plot that same data but in a way that imitates a spectrogram. I’ll facet it by formant so that F1 and F2 have equal space in the plot.
%>%
preds filter(yob == 1920) %>%
ggplot(aes(percent, hz_norm, color = allophone)) +
geom_path() +
facet_wrap(~formant, ncol = 1, scales = "free") +
theme_bw()
Okay, now that we’ve got the basic gist of the plot, we’re ready to make it a nicer!
Sprucing up the plots
If you’ve seen my vowel plots, you know that I like to make things a little bit fancier than what basic ggplot2 does for you. Some things are driven by principles of data visualization, some are aesthetics, and some are just for personal taste.
- I’ll reverse the order of F1 and F2 so that F2 is on top, better imitating a spectrogram. I do that using
fct_rev
withinmutate
as a pre-processing stage, before the actualggplot
call. - I’ll change the colors so that they’re a little nicer and colorblind friendly using the
scale_color_ptol
function in theggthemes
package. This uses Paul Tol’s color schemes, which you can read more about here. - I’ll also add a title and make the y-axis label a little less code-y with
labs
.
- I’ll change the x-axis ticks so that they line up with values that are more meaningful (what is halfway between 0.5 and 0.25?) with
scale_y_continuous
.
%>%
preds filter(yob == 1920) %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm, color = allophone)) +
geom_path() +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_color_ptol() +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
Already, that version of the plot looks a lot crisper than the earlier one.
Adding confidence intervals
From this point on, I’ll do a little more of a step-by-step walk-through of the plot so that you can see what each part does and why I did it that way.
Once crucial element for our plot is that we want to see the confidence intervals. We’ll do that with geom_ribbon
and use the CI
column in our data to help determine the width.
%>%
preds filter(yob == 1920) %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm, color = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI)) +
geom_path() +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_color_ptol() +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
Okay, so that’s not pretty. Let’s make it lighter. I’ll set the color of these bands intervals to be "gray75"
which is kinda like a shade of gray three-quarters of the way from black to white. I always forget this when I make plots, but this is done with the fill
argument rather than the color
argument.
%>%
preds filter(yob == 1920) %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm, color = allophone, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75") +
geom_path() +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_color_ptol() +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
OKay that almost works, but when the two overlap the red on is on top. So, I’ll set the alpha level (= transparency) to be 0.5 so that we can see overlapping areas a little better.
%>%
preds filter(yob == 1920) %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm, color = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path() +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_color_ptol() +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
There we go, now we can see the bands. The problem I have now is that I don’t really want the edges of the ribbons to be colored—I’d rather they just be gray. If you look at the code, the color = allophone
argument is in the main ggplot
function, which means it’ll apply to all layers in the plot where relevant (here, geom_ribbon
and geom_path
). Instead, I only want it to apply to one layer, geom_path
. So I’ll just move it to that function instead.
%>%
preds filter(yob == 1920) %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_color_ptol() +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
Well great, what happened here? The ribbons are now really faded or something. As it turns out, when geom_ribbon
was inheriting the color = allophone
argument, that was what kept the two ribbons distinct from each other. Now, there’s nothing in the geom_ribbon
function or its inherited aesthetics from ggplot
that is distinguishing ZEAL from GUILT. I don’t want to put color back in. So instead, I’ll put group = allophone
. That’ll tell it to group things by allophone. I’ll go ahead and put that in the main ggplot
call because it won’t affect geom_path
.
%>%
preds filter(yob == 1920) %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_color_ptol() +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
Okay, so now we’ve got some decent-looking ribbons.
Indicate significance
In my LSA presentation I had several things in the plot to help indicate where the confidence intervals overlapped. Let’s take a look at that real quick.
So, there are several things going on here.
- Where the confidence intervals overlap, the color of the line turns to a darker gray.
- If you look closely, the size of the gray line is thinner than the colored lines.
- A dotted vertical black line indicates the crossing point.
So, what we need to do is figure out not only where along the duration of the vowel the confidence intervals overlap, but we also need to figure out that crossing point. The first task is pretty straightforward. The second is a bit more complicated. The third is surprisingly tricky. Let’s start with the first.
Finding where confidence intervals overlap
To begin, let’s look at the structure of our data as is so we know what we’re working with.
%>%
preds filter(yob == 1920) %>%
head()
formant allophone percent yob hz_norm CI
1 F1 ZEAL 0.1000 1920 0.4013465 0.05088054
2 F1 GUILT 0.1000 1920 0.4853472 0.05443935
3 F1 ZEAL 0.1008 1920 0.4012991 0.05085728
4 F1 GUILT 0.1008 1920 0.4856286 0.05441697
5 F1 ZEAL 0.1016 1920 0.4012518 0.05083414
6 F1 GUILT 0.1016 1920 0.4859101 0.05439473
We’ve got F1 and F2 predictions in the same column (hz_norm
). Since we need F1 and F2 to talk to each other to see if they overlap, the first step that we’ll need to do is to reshape the data so that F1 and F2 are in separate columns. This puts the predicted F1 and F2 measurements from any vowel token from any percent on the same row. We’ll do that the same way that we did it previously. We really want the predicted formant values and the confidence intervals so be sure to put both hz_norm
and CI
in the values_from
argument of pivot_wider
.
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
print()
# A tibble: 2,002 × 7
formant percent yob hz_norm_ZEAL hz_norm_GUILT CI_ZEAL CI_GUILT
<chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 F1 0.1 1920 0.401 0.485 0.0509 0.0544
2 F1 0.101 1920 0.401 0.486 0.0509 0.0544
3 F1 0.102 1920 0.401 0.486 0.0508 0.0544
4 F1 0.102 1920 0.401 0.486 0.0508 0.0544
5 F1 0.103 1920 0.401 0.486 0.0508 0.0544
6 F1 0.104 1920 0.401 0.487 0.0508 0.0543
7 F1 0.105 1920 0.401 0.487 0.0507 0.0543
8 F1 0.106 1920 0.401 0.487 0.0507 0.0543
9 F1 0.106 1920 0.401 0.488 0.0507 0.0543
10 F1 0.107 1920 0.401 0.488 0.0507 0.0542
# ℹ 1,992 more rows
At this point, it’s pretty straightforward math: if the predicted line for ZEAL plus its confidence interval is lower than the predicted line for GUILT minus its confidence interval, than the difference is significant.
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
mutate(is_significant = hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT) %>%
print()
# A tibble: 2,002 × 8
formant percent yob hz_norm_ZEAL hz_norm_GUILT CI_ZEAL CI_GUILT
<chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 F1 0.1 1920 0.401 0.485 0.0509 0.0544
2 F1 0.101 1920 0.401 0.486 0.0509 0.0544
3 F1 0.102 1920 0.401 0.486 0.0508 0.0544
4 F1 0.102 1920 0.401 0.486 0.0508 0.0544
5 F1 0.103 1920 0.401 0.486 0.0508 0.0544
6 F1 0.104 1920 0.401 0.487 0.0508 0.0543
7 F1 0.105 1920 0.401 0.487 0.0507 0.0543
8 F1 0.106 1920 0.401 0.487 0.0507 0.0543
9 F1 0.106 1920 0.401 0.488 0.0507 0.0543
10 F1 0.107 1920 0.401 0.488 0.0507 0.0542
# ℹ 1,992 more rows
# ℹ 1 more variable: is_significant <lgl>
At this point, we can now just revert the shape, and we’re ready to plot. Reverting the shape with pivot_longer
isn’t trivial, because it uses some regex and a special trick with ".value"
that lets you split the resulting column into two. I encourage you to look at a previous blog post of mine about reshaping data.
pivot_wider
on vowel data.I’ll incorporate this new information by changing the color of the lines to map to the is_significant
column. I’ll also add the size aesthetic and map it to is_significant
too. Finally, I’ll control the size by adding scale_size_manual
. I’ll purposely exaggerate the difference so we can make sure they’re doing what we want.
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
mutate(is_significant = hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT) %>%
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = is_significant, size = is_significant)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(1, 3)) +
scale_color_ptol() +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
Well shoot. What’s wrong with this plot? I see two things. First, while the significance was accurately detected for F1, it was not for F2 for some reason. The second problem is that we’ve got color for significance rather than by allophone. Let’s fix both of these, starting with the first.
A more robust method
The code we did for detecting significance was this. Keep in mind that this is just one chunk of our pipeline that I’ve separated for illustrative purposes and it’s not going to run by itself.
# Calculate the difference
mutate(is_significant = hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT) %>%
The problem is that this works only if ZEAL is lower than GUILT. Based on the plots, it’s clear that it is, but only for F1. How can we make the calculation more general so that it detects overlaps without us predetermining which one is supposed to be higher?
There are probably a lot of solutions out there, but here’s the solution I came up with. Given each formant, we need to programatically determine which allophone is higher. We could do it as something like this:
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT)
This would create a new column called ZEAL_is_higher
. If ZEAL is indeed higher, the value in that column for that row is TRUE
. Otherwise, it’s FALSE
. We can now use this information to calculate the confidence interval a little more robustly. I’ve chosen to use case_when
instead of if_else
: I find the extra typing worth it for the clarity that it offers.
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
So what this does is it first determines if ZEAL is higher than GUILT. If it is, then it’ll see if the confidence interval of ZEAL is higher than the confidence interval of GUILT. If it’s lower, it’ll see if it’s lower.
For this to work, we’ll need to do this independently for each formant. And for good measure, we should do it independently for each birth year and each percent as well.
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
If we incorporate this into our pipeline, we should have made an improvement to our plot.
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = is_significant, size = is_significant)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(1, 3)) +
scale_color_ptol() +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
Success! The plot is still bad, but it at least has correctly detected regions of significance for both F1 ad F2. Now we need to figure out how to fix the colors.
Add gray for non-significance
The core of the problem is that we are actually mapping two pieces of information to a single aesthetic: allophone and significance. This is something that is normally not done in ggplot2. So, we’re going to go about it in a bit of a hacky way. Basically, what we need to do is expand our case_when
function above to three levels: “ZEAL”, “GUILT”, and “not significant.” Fortunately, this is pretty straightforward to do. Here’s what the code looks like:
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant")))
What it does is it creates a new column, allophone_plus_sig
. For each row in the spreadsheet, if it’s not significant, or rather if is_significant
is FALSE
, then the value in the new allophone_plus_sig
column is set to “not significant”. Otherwise, it’ll just take the value from the allophone
column, which, in our case, is "ZEAL"
or "GUILT"
. The reason why allophone
is wrapped in as.character()
is because case_when
requires all output values to be of the same type, and since "not significant"
is a string, specifically a character vector, I have to turn allophone
, a factor, into a character vector. Don’t worry too much about it. But, I guess what I need to do is then turn that three-way allophone_plus_sig
back into a factor, which allows me to control the ordering.
Okay, so now we’re ready to incorporate that back into the pipeline. Because we’re using information about the allophone
column, it has to happen after our retransformation back into the original format. Here’s the completed pipeline, and plot.
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
# Add three-way significance + allophone
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant"))) %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(1, 3)) +
scale_color_ptol() +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
Hooray! It looks like it works. Now what we’ll do is set the colors ourselves. Because I’m adding gray to the mix, I can’t use scale_color_ptol
anymore and I’ll have to use scale_color_discrete
and tell it the colors I want manually. I’ll use ptol_pal()
to access those same blue and red colors though and then I’ll add gray as a third option.
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
# Revert the shape
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
# Add three-way significance + allophone
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant"))) %>%
mutate(formant = fct_rev(formant)) %>%
ggplot(aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
Hey, we’re looking pretty good! The plot is really starting to come together!
Finding transition points
The last thing we need to do for indicating significance is finding the transition points and adding that black vertical dotted line. For some reason this is a harder task than I feel like it should be. It’s easy to spot visually, but not quite as easy for the computer to do it. Maybe I’m doing it wrong, but here’s how I do it.
As a reminder, here’s what our current dataset looks like just before it gets plotted.
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
# Revert the shape
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
# Add three-way significance + allophone
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant"))) %>%
mutate(formant = fct_rev(formant)) %>%
print()
# A tibble: 4,004 × 9
# Groups: formant, yob, percent [2,002]
formant percent yob ZEAL_is_higher is_significant allophone hz_norm CI
<fct> <dbl> <int> <lgl> <lgl> <chr> <dbl> <dbl>
1 F1 0.1 1920 FALSE FALSE ZEAL 0.401 0.0509
2 F1 0.1 1920 FALSE FALSE GUILT 0.485 0.0544
3 F1 0.101 1920 FALSE FALSE ZEAL 0.401 0.0509
4 F1 0.101 1920 FALSE FALSE GUILT 0.486 0.0544
5 F1 0.102 1920 FALSE FALSE ZEAL 0.401 0.0508
6 F1 0.102 1920 FALSE FALSE GUILT 0.486 0.0544
7 F1 0.102 1920 FALSE FALSE ZEAL 0.401 0.0508
8 F1 0.102 1920 FALSE FALSE GUILT 0.486 0.0544
9 F1 0.103 1920 FALSE FALSE ZEAL 0.401 0.0508
10 F1 0.103 1920 FALSE FALSE GUILT 0.486 0.0544
# ℹ 3,994 more rows
# ℹ 1 more variable: allophone_plus_sig <fct>
The key is that is_significant
column. At some point, all those FALSE
values will switch to TRUE
. We need to add some new column that indicates where that switch happens. The problem is that tidyverse functions—which this workflow is heavily entrenched in—generally run one row at a time. It’s difficult get some function to interact with some other row without flat out writing a for
loop. I’d rather keep all my data processing in a single pipeline if possible, so we’ll have to think of some way to do this.
The secret, I’ve found is with the lag
or lead
functions. To illustrate what they do, let’s take a single, arbitrary list of colors:
<- tibble(colors = c("red", "yellow", "blue", "green"))
colors colors
# A tibble: 4 × 1
colors
<chr>
1 red
2 yellow
3 blue
4 green
What lag
does is it takes that list, shifts all the elements of that list down by one, and puts an NA at the beginning. Basically, it returns what the next element of the list is. lead
does the opposite and shifts all the elements up by one slot. For more information on lead
and lag
, see the documentation.
%>%
colors mutate(prev_color = lag(colors),
next_color = lead(colors))
# A tibble: 4 × 3
colors prev_color next_color
<chr> <chr> <chr>
1 red <NA> yellow
2 yellow red blue
3 blue yellow green
4 green blue <NA>
So, this can be very useful for us. We want to look at that is_significant
column and find rows where the next row is not the same as the current row. So we want to find places where lead(is_significant)
is not the same as is_significant
.
So, the key piece of code we need is this:
mutate(switch = is_significant != lead(is_significant))
Now, we have to be careful about how this is done. Because we have data from lots of years and both formants all pooled together, we want to make sure that they don’t mess each other up. So, we’ll add a group_by
function to make sure that the code runs independently for each formant and each year of birth. Otherwise, we might have cases where the offset of one year is compared against the onset of another year, and that’s not what we want. For good measure I’ll also make sure the data is sorted by percent
just in case things got out of order at some point.
So the code we need to insert now is this block:
group_by(formant, yob) %>%
arrange(formant, yob, percent) %>%
mutate(switch = is_significant != lead(is_significant)) %>%
ungroup()
I’ll go ahead and insert that right after we find the significance. Here’s the full data processing pipeline now:
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
# Find where the switches happen
group_by(formant, yob) %>%
arrange(formant, yob, percent) %>%
mutate(switch = is_significant != lead(is_significant)) %>%
ungroup() %>%
# Revert the shape
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
# Add three-way significance + allophone
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant"))) %>%
mutate(formant = fct_rev(formant)) %>%
print()
# A tibble: 4,004 × 10
formant percent yob ZEAL_is_higher is_significant switch allophone hz_norm
<fct> <dbl> <int> <lgl> <lgl> <lgl> <chr> <dbl>
1 F1 0.1 1920 FALSE FALSE FALSE ZEAL 0.401
2 F1 0.1 1920 FALSE FALSE FALSE GUILT 0.485
3 F1 0.101 1920 FALSE FALSE FALSE ZEAL 0.401
4 F1 0.101 1920 FALSE FALSE FALSE GUILT 0.486
5 F1 0.102 1920 FALSE FALSE FALSE ZEAL 0.401
6 F1 0.102 1920 FALSE FALSE FALSE GUILT 0.486
7 F1 0.102 1920 FALSE FALSE FALSE ZEAL 0.401
8 F1 0.102 1920 FALSE FALSE FALSE GUILT 0.486
9 F1 0.103 1920 FALSE FALSE FALSE ZEAL 0.401
10 F1 0.103 1920 FALSE FALSE FALSE GUILT 0.486
# ℹ 3,994 more rows
# ℹ 2 more variables: CI <dbl>, allophone_plus_sig <fct>
We now have a new switch
column that is true only if the next timepoint’s significance is different from the current timepoint.
We can now incorporate that into the plot. I’ll add the vertical line with geom_vline
.
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
# Find where the switches happen
group_by(formant, yob) %>%
arrange(formant, yob, percent) %>%
mutate(switch = is_significant != lead(is_significant)) %>%
ungroup() %>%
# Revert the shape
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
# Add three-way significance + allophone
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant"))) %>%
mutate(formant = fct_rev(formant)) %>%
# Plot
ggplot(aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE), aes(xintercept = percent), linetype = "dotted") +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw()
Okay! We’re looking pretty good now. We’ve got a nice looking plot.
Labels on the plots
But we’re not done! Not even with the static plot—and we haven’t even gotten to the animations yet! As a reminder, here’s a plot I used in my LSA talk:
The only difference now is that we have a legend still. That legend has been bothering me this whole time and we need to get rid of it. The size is something that probably doesn’t need to be in the legend, and we can do a better job at indicating which color is which vowel. I try to avoid legends if at all possible and instead opt to annotate the plot directly.
As a first step, we can get rid of the legend by adding theme(legend.position = "none")
at the end of our plotting code.
%>%
preds filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
# Find where the switches happen
group_by(formant, yob) %>%
arrange(formant, yob, percent) %>%
mutate(switch = is_significant != lead(is_significant)) %>%
ungroup() %>%
# Revert the shape
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
# Add three-way significance + allophone
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant"))) %>%
mutate(formant = fct_rev(formant)) %>%
# Plot
ggplot(aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE), aes(xintercept = percent), linetype = "dotted") +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none")
That at least takes care of removing the legend. But now we need to figure out how to annotate the plot.
A simple solution is to manually put some labels in. That works well if there’s only one plot you need to do. But since you may want to plot many vowel pairs, it gets cumbersome to have to do that every time. Plus, we’ll be animating these, which means the lines could move around and we don’t want to have to reposition the label every time. A better solution would be to programatically generate plot labels.
So what we’ll need to do is take our existing data and generate a new dataset that contains the information for where we want to put the labels. So let’s save the plotting data without actually plotting it.
<- preds %>%
plotting_data filter(yob == 1920) %>%
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
# Find where the switches happen
group_by(formant, yob) %>%
arrange(formant, yob, percent) %>%
mutate(switch = is_significant != lead(is_significant)) %>%
ungroup() %>%
# Revert the shape
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
# Add three-way significance + allophone
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant"))) %>%
mutate(formant = fct_rev(formant))
I’ve found that a good position for the label is to put it near the onset of the vowel, slightly outside of the confidence intervals. So, let’s get those vowel onsets.
<- plotting_data %>%
label_data filter(percent == min(percent)) %>%
print()
# A tibble: 4 × 10
formant percent yob ZEAL_is_higher is_significant switch allophone hz_norm
<fct> <dbl> <int> <lgl> <lgl> <lgl> <chr> <dbl>
1 F1 0.1 1920 FALSE FALSE FALSE ZEAL 0.401
2 F1 0.1 1920 FALSE FALSE FALSE GUILT 0.485
3 F2 0.1 1920 TRUE FALSE FALSE ZEAL 2.08
4 F2 0.1 1920 TRUE FALSE FALSE GUILT 1.81
# ℹ 2 more variables: CI <dbl>, allophone_plus_sig <fct>
Now, we could just put both labels above the line. Here I’ll take the confidence interval times 1.5 and use that as the vertical position. That way it’s above the shaded area.
ggplot(plotting_data, aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE), aes(xintercept = percent), linetype = "dotted") +
geom_text(data = label_data, aes(label = allophone, y = hz_norm + 1.5*CI)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none")
But that doesn’t look too great and are in fact misleading. A better solution would be to put the upper one above the line and the lower one below it. Fortunately, we already know which line is higher because we got the ZEAL_is_higher
already figured out. So, with another case_when
statement, we can get a higher position for the top one and a lower position for the bottom one. Here’s
<- plotting_data %>%
label_data filter(percent == min(percent)) %>%
mutate(label_position = case_when(allophone == "ZEAL" & ZEAL_is_higher ~ hz_norm + 1.5*CI,
== "GUILT" & ZEAL_is_higher ~ hz_norm - 1.5*CI,
allophone == "ZEAL" & !ZEAL_is_higher ~ hz_norm - 1.5*CI,
allophone == "GUILT" & !ZEAL_is_higher ~ hz_norm + 1.5*CI)) %>%
allophone print()
# A tibble: 4 × 11
formant percent yob ZEAL_is_higher is_significant switch allophone hz_norm
<fct> <dbl> <int> <lgl> <lgl> <lgl> <chr> <dbl>
1 F1 0.1 1920 FALSE FALSE FALSE ZEAL 0.401
2 F1 0.1 1920 FALSE FALSE FALSE GUILT 0.485
3 F2 0.1 1920 TRUE FALSE FALSE ZEAL 2.08
4 F2 0.1 1920 TRUE FALSE FALSE GUILT 1.81
# ℹ 3 more variables: CI <dbl>, allophone_plus_sig <fct>, label_position <dbl>
Now, there are more efficient ways of coding that, but I like to be explicit, so I’ve opted for the slightly more verbose method.
ggplot(plotting_data, aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE), aes(xintercept = percent), linetype = "dotted") +
geom_text(data = label_data, aes(label = allophone, y = label_position)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none")
There we go. That’s a little bit better. For good measure, I’m going to also color the text using the same color assigned to the allophones. My first time doing this, I made the mistake of using the allophone_plus_sig
column just like the lines. The problem is now I had gray labels, which is not what I wanted.
ggplot(plotting_data, aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE), aes(xintercept = percent), linetype = "dotted") +
geom_text(data = label_data, aes(label = allophone, y = label_position, color = allophone_plus_sig)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none")
Instead, I should color it using just plain allophone
. That way ZEAL is always blue and GUILT is always red, regardless of the significance of the vowel at the onset.
ggplot(plotting_data, aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE), aes(xintercept = percent), linetype = "dotted") +
geom_text(data = label_data, aes(label = allophone, y = label_position, color = allophone)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none")
Okay, now that we’ve got a decent looking plot (at least least it matches the plot I used in my presentation), we can start to work on animating it!
Animate!
The gganimate
package is awesome and with just a few lines of code, we can make an animation. The simplest way to think about the animations is that whatever variable you would use in a facet_wrap
, you use that as the frames in the animation. In our plot, we are already faceting by formant (F1 and F2), but we want the animation to happen by birth year. Currently we’re only looking at one birth year (1920). In our animation, we would take out that one-year filter and animate along that variable instead.
First, what I’m going to do is regenerate my data so that all birth years are in there.
<- preds %>%
animating_data
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
# Find where the switches happen
group_by(formant, yob) %>%
arrange(formant, yob, percent) %>%
mutate(switch = is_significant != lead(is_significant)) %>%
ungroup() %>%
# Revert the shape
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
# Add three-way significance + allophone
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant"))) %>%
mutate(formant = fct_rev(formant))
<- animating_data %>%
label_data filter(percent == min(percent)) %>%
mutate(label_position = case_when(allophone == "ZEAL" & ZEAL_is_higher ~ hz_norm + 1.5*CI,
== "GUILT" & ZEAL_is_higher ~ hz_norm - 1.5*CI,
allophone == "ZEAL" & !ZEAL_is_higher ~ hz_norm - 1.5*CI,
allophone == "GUILT" & !ZEAL_is_higher ~ hz_norm + 1.5*CI)) allophone
Now, I could make a static plot that includes all the data at once.
ggplot(animating_data, aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE), aes(xintercept = percent), linetype = "dotted") +
geom_text(data = label_data, aes(label = allophone, y = label_position, color = allophone)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none")
A mess. But that’s fine. We could also make a version of the plot where there’s one panel per birth year.
ggplot(animating_data, aes(percent, hz_norm, group = allophone)) +
geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE), aes(xintercept = percent), linetype = "dotted") +
geom_text(data = label_data, aes(label = allophone, y = label_position, color = allophone)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(yob~formant, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none")
So here’s all the information at once. What we need to do now is animate across those birth years. It literally takes one line of code to do this: transition_time(yob)
. I’ll save the plot into an object called a
. I’ll then take that a
object and animate it with animate
. I’ll specificy the height, width, and resolution. This line of code may take a minute or so. Finally, I’ll export the plot with anim_save
, which lets me specify the path and the frames per second.
<- ggplot(animating_data, aes(percent, hz_norm, group = allophone)) +
a geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE), aes(xintercept = percent), linetype = "dotted") +
geom_text(data = label_data, aes(label = allophone, y = label_position, color = allophone)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = "Predicted trajectories for ZEAL and GUILT: birth year = 1920",
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none") +
transition_time(yob)
animate(a, height = 7.5, width = 13.33, unit = "in", res = 150)
anim_save(filename = "./plots/lsa2022/lsa2022_temp.gif", fps = 30)
Okay, this is looking pretty good, but there are a few details we’ll want to change. One is that we need to change the title now because it’s not just 1920. Fortunately gganimate
has a way to incorporate the year into the title itself. So I’ll chage the title we had in labs
so that it incorporates {frame_time}
.
Another smaller detail is the vertical black line. You’ll notice that it sort of moves quickly across the screen from left to right even though there’s no corresponding overlap in the confidence intervals. If you look at each plot individually, that line isn’t there. So what’s up with that? What’s actually happening in the data is the black line from the earlier in the gif disappears and then appears later on. So gganimate
fills in the gaps because it assumes all the black lines are the same underlying element, so it’ll add transitions between them.
With a little of trial-and-error on my part, I found that that the problem is the group
argument in aes
. Basically, what it’s doing it assuming that the vertical line “object” should be something that persists across the entire time frame represented here. In reality, we don’t need the thing to persist. There’s no reason why it should be there if it doesn’t have to. Anyway, long story short, to fix it, I’ll add group = yob
in geom_vline
. That way it’ll treat the vertical from each year as a distinct unit without attempting go connect the dots across time.
<- ggplot(animating_data, aes(percent, hz_norm, group = allophone)) +
a geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE),
aes(xintercept = percent, group = yob), linetype = "dotted") +
geom_text(data = label_data, aes(label = allophone, y = label_position, color = allophone)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = paste("Difference between ZEAL and GUILT (birth year: {frame_time})"),
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none") +
transition_time(yob)
animate(a, height = 7.5, width = 13.33, unit = "in", res = 150)
anim_save(filename = paste0("./plots/lsa2022/lsa2022_temp.gif"), fps = 10)
At this point, if you have a keen eye, you’ll notice that the black vertical line still kinda hangs out there on the left side for a while. I’ll admit I hadn’t seen that before my LSA presentation so it’s there in the “official” plot too. I don’t quite know how to fix that.
Pulling it all together
So, we now have our visual. Here is the code from start to finish. First, we got the data we wanted to visualize:
<- preds %>%
animating_data
# Reshape
pivot_wider(names_from = allophone, values_from = c(hz_norm, CI)) %>%
# Calculate the difference
group_by(formant, yob, percent) %>%
mutate(ZEAL_is_higher = hz_norm_ZEAL > hz_norm_GUILT,
is_significant = case_when(ZEAL_is_higher ~ hz_norm_ZEAL - CI_ZEAL > hz_norm_GUILT + CI_GUILT,
!ZEAL_is_higher ~ hz_norm_ZEAL + CI_ZEAL < hz_norm_GUILT - CI_GUILT)) %>%
# Find where the switches happen
group_by(formant, yob) %>%
arrange(formant, yob, percent) %>%
mutate(switch = is_significant != lead(is_significant)) %>%
ungroup() %>%
# Revert the shape
pivot_longer(cols = c(matches("hz_norm"), matches("CI")),
names_to = c(".value", "allophone"),
names_pattern = "(.+)_([A-Z]+)\\Z") %>%
# Add three-way significance + allophone
mutate(allophone_plus_sig = case_when(!is_significant ~ "not significant",
TRUE ~ as.character(allophone)),
allophone_plus_sig = factor(allophone_plus_sig, levels = c("ZEAL", "GUILT", "not significant"))) %>%
mutate(formant = fct_rev(formant))
Then we got a subset of that for the labels, and did some slight processing on that.
<- animating_data %>%
label_data filter(percent == min(percent)) %>%
mutate(label_position = case_when(allophone == "ZEAL" & ZEAL_is_higher ~ hz_norm + 1.5*CI,
== "GUILT" & ZEAL_is_higher ~ hz_norm - 1.5*CI,
allophone == "ZEAL" & !ZEAL_is_higher ~ hz_norm - 1.5*CI,
allophone == "GUILT" & !ZEAL_is_higher ~ hz_norm + 1.5*CI)) allophone
Finally, we plotted it.
<- ggplot(animating_data, aes(percent, hz_norm, group = allophone)) +
a geom_ribbon(aes(ymin = hz_norm - CI, ymax = hz_norm + CI), fill = "grey75", alpha = 0.5) +
geom_path(aes(color = allophone_plus_sig, size = is_significant)) +
geom_vline(data = . %>% filter(switch == TRUE),
aes(xintercept = percent, group = yob), linetype = "dotted") +
geom_text(data = label_data, aes(label = allophone, y = label_position, color = allophone)) +
scale_x_continuous(breaks = seq(0, 1, by = 0.1)) +
scale_size_manual(values = c(0.5, 1)) +
scale_color_manual(values = c(ptol_pal()(2), "gray50")) +
facet_wrap(~formant, ncol = 1, scales = "free") +
labs(title = paste("Difference between ZEAL and GUILT (birth year: {frame_time})"),
y = "frequency (normalized)") +
theme_bw() +
theme(legend.position = "none") +
transition_time(yob)
animate(a, height = 7.5, width = 13.33, unit = "in", res = 150)
anim_save(filename = paste0("./plots/lsa2022/lsa2022_temp.gif"), fps = 10)
Now if you want to get fancy, you could wrap all this up into a function so that it can be applied to whatever vowel pair you want to visualize in your data. That’s what I ended up having to do, which is how I got my four identically-formatted gifs in my presentation. I really like wrapping data viz code into functions because it lets me make lots of plots at once without copying and pasting. Anyway, I’ll let you work that out on your own. 99% of the code is the same and the only differences places where there are explicit mention of the vowel classes. You also of course have to prep the data as well.
And that’s it! Okay, now I expect more people to have animated vowel plots at the next LSA!