# Vowel overlap in R: More advanced topics

How-to Guides
Methods
Phonetics
R
Skills
Data Viz
Vowel Overlap
Author

Joey Stanley

Published

February 7, 2019

``options(dplyr.summarise.inform = FALSE)``

This is a continuation of my previous tutorial on how to calculate Pillai scores and Bhattacharyya’s Affinity in R for the purposes of measuring vowel overlap. It occurred to me as I was putting the previous one together though that I had a lot of things to say and the tutorial got really long and complicated. So I moved all the more advanced topics to this one to keep the main one a little lighter and more approachable.

In this post, I’ll cover some topics like what to do if you have multiple vowel pairs you want to measure in each speaker, errors you may encounter with the `select` function when you’re calculating Bhattacharyya’s Affinity, ways at making the functions less error-prone, and some visualizations you can do with your data after you’ve collected it.

Note

Update (November 23, 2021): Betsy Sneller and I have done some recent research on Pillai scores. Please see the summary of our ASA2021 poster (and the paper itself) for more information.

## Data prep

The data prep for this post is covered in Part 1 already, so I’ll put the code here without further comment.

``library(tidyverse)``
``````my_vowels <- read.csv("../../data/joey.csv") %>%
filter(stress == 1) %>%
select(vowel, word, dur, F1.50., F2.50., fol_seg, plt_manner, plt_place, plt_voice) %>%
rename(F1 = F1.50., F2 = F2.50.) %>%
mutate(fake_speaker = sample(c("Joey", "Stanley"), nrow(.), replace = TRUE))``````
``````low_back <- my_vowels %>%
filter(vowel %in% c("AA", "AO"),
!fol_seg %in% c("L", "R"),
word != "ON")``````

So where I left off at the last tutorial was a function that can calculate the Pillai score and another for the Bhattacharyya’s Affinity, each with the help of `summarize`:

``````pillai <- function(...) {
summary(manova(...))\$stats["vowel","Pillai"]
}
bhatt <- function (F1, F2, vowel) {
vowel_data <- droplevels(data.frame(vowel))

sp_df <- sp::SpatialPointsDataFrame(cbind(F1, F2), vowel_data)
}
low_back %>%
summarize(low_back_pillai = pillai(cbind(F1, F2) ~ vowel),
low_back_bhatt = bhatt(F1, F2, vowel))``````
``````  low_back_pillai low_back_bhatt
1       0.2250877      0.8296413``````

Okay cool. Let’s see what else can be done with this.

## Multiple vowel pairs

What I’ve shown so far is how to calculate the Pillai score for multiple speakers for a single pair of vowels. The next question is what how to get the Pillai score for multiple speakers for multiple pairs of vowels.

Unfortunately, I don’t really know of a quick way to do that. The problem is a single vowel might be used in multiple pairs (like measuring trap-lot overlap or trap-fleece overlap). Plus, you might be only interested in certain vowels in certain phonetic environments, like before nasals or before tautosyllabic /l/, so defining all that on the fly would be tricky. So there’s no good way that I know of to loop through or group everything in a straightforward manner like we could with speaker.

The easiest solution I can think of is to define separate datasets like we’ve done with `low_back` and run the same code on them. That way, you have all the flexibility of defining specific allophones for each question. And, you probably won’t be getting the Pillai score on too many pairs of vowels, right?

Let’s make two more subsets. First, I’ll look at the pin-pen merger which I don’t really have, but is common in the South where I live right now. (I’ll also remove some stop words from the dataset.) I’ll also look at the fail-fell merger, which I also don’t really have, but can be found in places like Utah and parts of Texas. I’ll create these subsets and plot them so you can see what they look like in my own speech. Maybe you can roughly estimate the Pillai score.

``````pin_pen <- my_vowels %>%
filter(vowel %in% c("IH", "EH"),
fol_seg %in% c("M", "N"),
!word %in% c("IN", "INTO"))
``````  vowel        word  dur    F1     F2 fol_seg plt_manner plt_place plt_voice
1    EH       MEANT 0.06 448.8 1631.9       N      nasal    apical    voiced
2    EH      MENTAL 0.06 570.5 1678.6       N      nasal    apical    voiced
3    EH       ENTER 0.12 482.7 1738.0       N      nasal    apical    voiced
4    IH   BEGINNING 0.05 338.9 1968.6       N      nasal    apical    voiced
5    EH TEMPERATURE 0.06 495.6 1674.6       M      nasal    labial    voiced
6    EH TEMPERATURE 0.05 477.2 1612.0       M      nasal    labial    voiced
fake_speaker
1      Stanley
2         Joey
3         Joey
4         Joey
5         Joey
6      Stanley``````
``````ggplot(pin_pen, aes(F2, F1, color = vowel, label = word)) +
geom_text(size = 3) +
scale_x_reverse() + scale_y_reverse()``````

``````fail_fell <- my_vowels %>%
filter(vowel %in% c("EY", "EH"), fol_seg =="L")
``````  vowel         word  dur    F1     F2 fol_seg plt_manner plt_place plt_voice
1    EY     AILMENTS 0.06 387.3 2149.5       L    lateral    apical    voiced
2    EY       SAILOR 0.08 382.5 1898.5       L    lateral    apical    voiced
3    EH    CELLPHONE 0.09 509.4 1549.1       L    lateral    apical    voiced
4    EY SURVEILLANCE 0.09 468.9 2104.8       L    lateral    apical    voiced
5    EY       TAYLOR 0.05 311.7 2069.7       L    lateral    apical    voiced
6    EH         ELSE 0.06 583.1 1343.6       L    lateral    apical    voiced
fake_speaker
1         Joey
2      Stanley
3         Joey
4      Stanley
5      Stanley
6         Joey``````
``````ggplot(fail_fell, aes(F2, F1, color = vowel, label = word)) +
geom_text(size = 3) +
scale_x_reverse() + scale_y_reverse()``````

Okay, so now we’re ready to calculate Pillai scores to those new pairs. Because we’ve created the `pillai` function, we can relatively easily apply it to each of these datasets.

``````pin_pen %>%
summarize(pin_pen_pillai = pillai(cbind(F1, F2) ~ vowel))``````
``````  pin_pen_pillai
1      0.6964555``````
``````fail_fell %>%
summarize(fail_fell_pillai = pillai(cbind(F1, F2) ~ vowel))``````
``````  fail_fell_pillai
1        0.8189749``````

So this is pretty cool. Here you can see that my fail-fell vowel classes are quite distinct with a Pillai score of 0.82. My pin-pen vowels are also quite distinct with a Pillai score of 0.70.

So we’ve run the same thing three times on three different datasets. But can we do this even more elegantly? Sure!

Okay, so first, I’ll combine all three datasets (`low_back`, `pin_pen` and `fail_fell`) into one combined dataframe using `bind_rows`. But in order to “keep track” of which dataframe a particular row came from, I’ll create a new column called `vowel_pair` using the `.id = "vowel_pair"` argument. To get this to work then, I’ll “name” each of the dataframes as I combine it. So for example, the `low_back` dataframe is called `"low back"` and I show that by putting the new name between ticks (that little thing next to the number 1 key on my keyboard that looks like this: `). If you combine the dataframes in this way, you’ll have a new column saying which one it came from.

``````all_pairs <- bind_rows(`low back` = low_back,
`pin pen`  = pin_pen,
`fail fell` = fail_fell,
.id = "vowel_pair")
``````  vowel_pair vowel   word  dur    F1     F2 fol_seg plt_manner   plt_place
1   low back    AA   TODD 0.17 614.6 1065.2       D       stop      apical
2   low back    AA    GOD 0.09 554.0 1250.3       D       stop      apical
3   low back    AA FATHER 0.12 598.6 1000.2      DH  fricative interdental
4   low back    AO  WATER 0.05 587.5  986.7       T       stop      apical
5   low back    AO   LONG 0.09 577.6 1008.3      NG      nasal       velar
6   low back    AA STOCKS 0.13 578.1 1073.7       K       stop       velar
plt_voice fake_speaker
1    voiced         Joey
2    voiced      Stanley
3    voiced         Joey
4 voiceless         Joey
5    voiced      Stanley
6 voiceless         Joey``````
``tail(all_pairs)``
``````    vowel_pair vowel     word  dur    F1     F2 fol_seg plt_manner plt_place
172  fail fell    EY     GAIL 0.12 431.1 1908.6       L    lateral    apical
173  fail fell    EY AILMENTS 0.12 438.0 2048.3       L    lateral    apical
174  fail fell    EY     SAIL 0.06 423.7 1797.2       L    lateral    apical
175  fail fell    EH     FELL 0.05 502.3 1539.2       L    lateral    apical
176  fail fell    EY  WAILING 0.12 421.3 2086.7       L    lateral    apical
177  fail fell    EY    STALE 0.14 437.9 1862.4       L    lateral    apical
plt_voice fake_speaker
172    voiced         Joey
173    voiced      Stanley
174    voiced         Joey
175    voiced      Stanley
176    voiced         Joey
177    voiced      Stanley``````

Now that may seem like more work than it’s worth, but the payoff is that when you actually go to do the Pillai scores, it’s hardly any more complicated than before. In the `group_by` function, just put that you want to group the data by the vowel pair and by the speaker and it’ll magically do the rest for you.

``````all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = pillai(cbind(F1, F2) ~ vowel))``````
``````# A tibble: 6 × 3
# Groups:   vowel_pair [3]
vowel_pair fake_speaker pillai
<chr>      <chr>         <dbl>
1 fail fell  Joey          0.825
2 fail fell  Stanley       0.849
3 low back   Joey          0.117
4 low back   Stanley       0.301
5 pin pen    Joey          0.731
6 pin pen    Stanley       0.708``````

In my opinion, it’s worth it to do the extra bit of work beforehand prepping your data with the whole `bind_rows` business and creating the new function because the benefit is that when it actually comes time to calculate the Pillai score, you’ve got a tidy dataset to work with and a flexible function to use.

## The `select` clash and Bhattacharyya’s Affinity

Okay, in Part 1, I talked about the library needed to run Bhattacharyya’s Affinity, `adehabitatHR`. However, it always seemed to be the case that my tidyverse code, particularly the `select` function always broke in scripts that used this package. It was super annoying and for the longest time I couldn’t figure out why.

In Dan Johnson’s NWAV presentation where he introduces Bhattacharyya’s Affinity, he installs and loads an additional package `sp` so that the `SpatialPointsData-Frame` function can be used. This is technically not necessary to do explicitly, because `sp` is a dependency of `adehabitatHR`, meaning when you install and load `adehabitatHR` you also are bringing `sp` along too.

As it turns out, when you load the `adehabitatHR` package, it also loads the `MASS` package. Unfortunately for us, there’s a function in `MASS` called `select`. Well, `dplyr` also has a `select` function. Having two `select` functions at the same time creates a clash. So, R chooses the one that was loaded most recently, which is `MASS` (via `adehabitatHR`).

You can actually see this in a warning when you load `adehabitatHR` if `dplyr` (via `tidyverse`) is already installed:

To replicate the messages and output in this section you’ll have to restart R and load the packages again from scratch in this order. To do that, go to Session→Restart R.
``````library(tidyverse)

## ...output truncated...

## Attaching package: ‘MASS’

## The following object is masked from ‘package:dplyr’:

##     select

## ...output truncated...``````

And when you go to use `select`, R thinks it should run `MASS::select` instead of `dplyr::select`:

``````low_back %>%
select(vowel)

## Error in select(., vowel) : unused argument (vowel)``````

So, there are two solutions to this. First, you can simply take advantage of the fact that the most recently loaded package takes priority, and load `adehabitatHR` before you load `tidyverse`

``````library(adehabitatHR)
library(tidyverse)``````
``````low_back %>%
select(vowel)

##     vowel
## 1      AA
## 2      AA
## 3      AA
## 4      AA
## 5      AO
## 6      AA``````

That’ll ensure that `dplyr::select` takes precedence. However, I’ve never really liked this for a couple reasons. One, I’m stubborn, and I like to load my packages in a specific order (basically the order that I use them) and at the top of my code it seems weird to load a package before `tidyverse`. Also, it means that it I’m already going along in an R session, with `tidyverse` loaded already, if I want to then do some Bhattacharyya’s Affinity, I’ll need to either unload the library or restart R, which is never really all that convenient.

The other option is to not actually load the `adehabitatHR` package at all. You can still use functions when a package is not loaded (as long as you already have them installed to your computer) by prefacing the function name with the name of the package and a pair of colons. So when it comes time to use the one function I need from `adehabitatHR`, the `kerneloverlap` function, I’ll just type `adehabitatHR::kerneloverlap`. In my opinion, this is a better route just because I don’t like loading an entire package (and reworking lots of others things in my code) just to make sure of one function.

For the remainder of this tutorial, I’m going to implement this second option. And for consistency, I’ll also not load the `sp` package either and call `SpatialPointsDataFrame` using `sp::` prefixed before it.

## Making the functions more robust

In this section, I’ll go into some more detail about how to make your functions less likely to crash. Some of the topics here get tedious and cover use some more advanced R skills. If you’ve been finding that your code breaks when you apply it to a bunch of speakers, this might help with these issues.

I’ll start with the `bhatt` function because it’s a little more straightforward. Then we’ll get into some slightly more confusing stuff with `pillai`.

### Making `bhatt` more robust

I’ve noticed when running Bhattacharyya’s affinity on my data that it tends to crash if certain conditions aren’t met. For example, the calculation requires at least five observations from each vowel class to work. So, let’s say I wanted to look at the pull-pole merger (that is, the merger of foot and goat before laterals). We can create the dataset the same way as before

I think I have this merger, but I’m really not sure. I thought studying it in my data would help my own intuitions, but now I’m always hyperaware of the relatively small group of relevant words. But I digress…
``````pull_pole <- my_vowels %>%
filter(vowel %in% c("UH", "OW"), fol_seg %in% c("L")) %>%
droplevels()
``````  vowel       word  dur    F1    F2 fol_seg plt_manner plt_place plt_voice
1    OW       GOAL 0.22 366.7 827.0       L    lateral    apical    voiced
2    OW       POLO 0.19 418.2 893.9       L    lateral    apical    voiced
3    OW      ROLLS 0.11 385.7 785.1       L    lateral    apical    voiced
4    OW      MOULD 0.06 388.1 732.0       L    lateral    apical    voiced
5    OW  PORTFOLIO 0.06 444.1 988.3       L    lateral    apical    voiced
6    OW CONTROLLED 0.07 389.0 849.7       L    lateral    apical    voiced
fake_speaker
1      Stanley
2         Joey
3      Stanley
4         Joey
5      Stanley
6      Stanley``````

But when we run it…

``````pull_pole %>%
summarize(pull_pole_bhatt = bhatt(F1, F2, vowel))

## Error in kernelUD(xy, same4all = TRUE, ...): At least 5 relocations
#  are required to fit an home range``````

It crashes. You can see in the error message that it says you need at least five “relocations” per “home range”. You can tell this package was intended for animal location data instead of vowel data! But we can see that there aren’t five words in the pull (= “UH”) class:

``table(pull_pole\$vowel)``
``````
OW UH
50  1 ``````

There’s just one token! This is problematic because if I’m going to apply this function to like 30 speakers, even if one speaker has insufficient data I want it to work on everyone else without crashing.

So, how can I make the `bhatt` function a little more robust? Expert R users might know of fancier ways to do what I’m about to do, but I just put a little `if` statement in there: if the lowest value in that table is less than five, then `return`—meaning abort the function—with the value `NA`.

``````bhatt <- function (F1, F2, vowel) {
vowel_data <- data.frame(vowel) %>%
droplevels()

if (min(table(vowel_data)) < 5) return(NA) # there are at least 5 of each vowel

}
pull_pole %>%
summarize(pull_pole_bhatt = bhatt(F1, F2, vowel))``````
``````  pull_pole_bhatt
1              NA``````

Now, when I run it, it won’t crash anymore. It won’t return a number and instead it’ll just return `NA`, but at least it doesn’t crash. You can appreciate now how handy this might be when we combine all the data together and then then the function on all vowel pairs.

``````all_pairs <- bind_rows(`low back` = low_back,
`pin pen`  = pin_pen,
`fail fell` = fail_fell,
`pull pole` = pull_pole,
.id = "vowel_pair")
all_pairs %>%
group_by(vowel_pair) %>%
summarize(bhatt = bhatt(F1, F2, vowel))``````
``````# A tibble: 4 × 2
vowel_pair  bhatt
<chr>       <dbl>
1 fail fell   0.122
2 low back    0.830
3 pin pen     0.393
4 pull pole  NA    ``````

Now, the function will work just fine on the vowel pairs with enough data and will quietly return `NA` if there’s not enough data.

As it turns out, there were some other problems I ran into in my data, so I had to make the function even more robust. I’ve gone ahead and added a couple other warnings as well. So like one time I didn’t have any tokens of a particular vowel, but plenty of the other. So after applying `droplevels`, only one vowel remained, so then when I ran `table`, the lowest number in that table was indeed higher than 5, but it crashed because there was only one vowel there. So, I added a check to makes sure that the table contained exactly two vowel tokens.

Finally, I found out that sometimes there will be a speaker or a vowel pair that happens to have no data. Zero tokens of either vowel. In theory, the other two checks should handle it, but it turns out the checks themselves crash if there’s zero data. So, I added that first line, `if (nrow(vowel_data) < 1) return(NA)`, before the other checks are done, to ensure that there’s at least something there.

So, the complete function looks like this:

``````bhatt <- function (F1, F2, vowel) {
vowel_data <- data.frame(vowel) %>%
droplevels()

# Make sure there's enough data
if (nrow(vowel_data) < 1) return(NA) # not zero
if (length(table(vowel_data)) < 2) return(NA) # both vowels are represented
if (min(table(vowel_data)) < 5) return(NA) # there are at least 5 of each vowel

}``````

This version of the function is the one that I use in my data. It’s a little more robust than the simpler version and it can safely iterate over whatever groups you want.

``````low_back %>%
group_by(fake_speaker) %>%
summarize(bhatt = bhatt(F1, F2, vowel))``````
``````# A tibble: 2 × 2
fake_speaker bhatt
<chr>        <dbl>
1 Joey         0.892
2 Stanley      0.728``````
``````all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(bhatt = bhatt(F1, F2, vowel))``````
``````# A tibble: 8 × 3
# Groups:   vowel_pair [4]
vowel_pair fake_speaker   bhatt
<chr>      <chr>          <dbl>
1 fail fell  Joey          0.124
2 fail fell  Stanley       0.0830
3 low back   Joey          0.892
4 low back   Stanley       0.728
5 pin pen    Joey          0.291
6 pin pen    Stanley       0.453
7 pull pole  Joey         NA
8 pull pole  Stanley      NA     ``````

So that’s it for the Bhattacharyya’s Affinity (for now). Now let’s more on to `pillai`.

### Trying to make `pillai` more robust

Because of the way `pillai` is implemented right now, particularly with the `...` syntax, it’s a little tricky to add the same data validation checks that we added in the `bhatt` function above. In fact, I spent a bit of time on it, but the main hurdle is that the `manova` function requires things to be in a formula (that is, something like `y ~ x`). So, for now, I can’t offer a perfect solution to the `pillai` function.

Instead, I offer a less-than-ideal solution: two different functions, each with their strengths and weaknesses.

The first is `pillai`, which is exactly how we have it already (repeated here for clarity):

``````pillai <- function(...) {
summary(manova(...))\$stats["vowel","Pillai"]
}``````

The good part about this function is that whatever formula you want to use, works perfectly fine. So if you want to add duration or F3 to the dependent variable matrix, or pack on all sorts of social or phonological factors, go for it and it’ll work as long as `manova` can handle it.

The bad part about this is that we can’t add any additional data validation procedures to account for potentially missing data. So like if I try to run it on the `pull_pole` dataset like I did with `bhatt`, it’ll crash.

``````all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai_score = pillai(cbind(F1, F2) ~ vowel))

## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]):
## contrasts can be applied only to factors with 2 or more levels``````

Now, apparently `manova` can handle less data than `kerneloverlap` can, because the only reason that crashed is because one of my fake speakers had zero tokens of pull. I’ll try it again but pooling all my data together. If there’s even just one token, it’ll actually run and return a value for you.

``````all_pairs %>%
group_by(vowel_pair) %>%
summarize(pillai_score = pillai(cbind(F1, F2) ~ vowel))``````
``````# A tibble: 4 × 2
vowel_pair pillai_score
<chr>             <dbl>
1 fail fell        0.819
2 low back         0.225
3 pin pen          0.696
4 pull pole        0.0687``````

I’ll admit, I’m a little skeptical of a Pillai score where one of the vowels only has one token in it. And sure enough, if we look at the model output, the p-value is high, probably because there’s just not a lot of data.

``summary(manova(cbind(F1, F2) ~ vowel, data = pull_pole))``
``````          Df   Pillai approx F num Df den Df Pr(>F)
vowel      1 0.068656   1.7692      2     48 0.1814
Residuals 49                                       ``````

So, if I could, I would still implement the data validation checks like I did with `bhatt` to ensure there’s enough data before running so that I can feel more confident about the values I’m getting, rather than getting values without any sort of warning message informing me of the potentially bad result.

Well, one solution is to create a separate function that can do the checks for sufficient data. I’ll create one called `pillai2` that actually mimics the syntax used in `bhatt`.

``````pillai2 <- function(F1, F2, vowel) {

vowel_data <- data.frame(vowel) %>%
droplevels()

if (nrow(vowel_data) < 1) return(NA) # not zero
if (length(table(vowel_data)) < 2) return(NA) # both vowels are represented
if (min(table(vowel_data)) < 5) return(NA) # there are at least 5 of each vowel

summary(manova(cbind(F1, F2) ~ vowel))\$stats["vowel","Pillai"]
}``````

Now, when I call `pillai2`, I don’t need to use `cbind` or the formula syntax. Instead, I just give it the name of the columns corresponding to my F1, F2, and vowel columns.

``````all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = pillai2(F1, F2, vowel))``````
``````# A tibble: 8 × 3
# Groups:   vowel_pair [4]
vowel_pair fake_speaker pillai
<chr>      <chr>         <dbl>
1 fail fell  Joey          0.825
2 fail fell  Stanley       0.849
3 low back   Joey          0.117
4 low back   Stanley       0.301
5 pin pen    Joey          0.731
6 pin pen    Stanley       0.708
7 pull pole  Joey         NA
8 pull pole  Stanley      NA    ``````

So, yay, right? The problem with this version of the `pillai2` function is that you can’t control the MANOVA formula. So if you want to run a MANOVA that controls for things like following place of articulation, this function, `pillai2`, won’t be able to do that, unlike the previous version, `pillai`.

My recommendation is that if you want to add additional variables, use the original `pillai` function (not this `pillai2` that we just made) but do the data validation separately.

Switching gears one more time, there’s one more thing we can do to the `pillai` function to make it more robust. Right now, it assumes that the variable in your dataframe is called “vowel”. Sometimes, that might not be the case. I sometimes use “vowel_class” if I’m working with specific subsets of the data. If I run the `pillai` function as it is on a dataframe like that, it’ll crash.

``````all_pairs %>%
rename(vowel_class = vowel) %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = pillai2(F1, F2, vowel))

So, we’ll have to add a little more robustness to the function. You could use the number `1` instead of `"vowel"` because the vowel argument is first in our formula.

``````pillai <- function(...) {
summary(manova(...))\$stats[1,"Pillai"]
}
pillai(cbind(F1, F2, dur) ~ vowel + plt_place + plt_manner + plt_voice, data = low_back)``````
``[1] 0.2767855``

But that assumes the vowel argument is first, which it might not always be. Here, I’ll put the `plt_place` variable first after the tilde, and you’ll notice that the result changes.

``````pillai <- function(...) {
summary(manova(...))\$stats[1,"Pillai"]
}
pillai(cbind(F1, F2, dur) ~ plt_place + plt_manner + plt_voice + vowel, data = low_back)``````
``[1] 0.3752574``

The solution here is to either make sure the `vowel` column is first in your function or to rename whatever it is that your column is called to “vowel.” I don’t like either of those solutions personally but, yet again, I don’t really have a good solution. Using the number `1` might be slightly better than changing the column name because it doesn’t force you to change the underlying structure of your dataframe.

However, if you like the `pillai2` function more—the one that has some robustness already built into it but is limited to just one independent variable—you’re in luck because it actually can handle different column names already. Regardless of what my vowel column is called, when it gets passed to the function, it is temporarily renamed `"vowel"`. So, this will still work:

``````all_pairs %>%
rename(this_is_a_random_long_name = vowel) %>%
group_by(vowel_pair) %>%
summarize(pillai = pillai2(F1, F2, this_is_a_random_long_name))``````
``````# A tibble: 4 × 2
vowel_pair pillai
<chr>       <dbl>
1 fail fell   0.819
2 low back    0.225
3 pin pen     0.696
4 pull pole  NA    ``````

So again, there’s a toss up between `pillai`, which can handle any MANOVA function, and `pillai2` which is less error prone. Perhaps in the future I’ll be able to find a way to combine both into one super robust `pillai` function. For now, you know as much as I do.

## Miscellaneous material

At this point, we’re done; I won’t be playing around with the functions anymore. Instead, in this last section, I’ll look at some fun tricks when using them and how to reshape the output to be more useful for you. Again, parts of this section get a little tedious and use some tricksy R stuff.

### Combining functions

Sometimes, it’s a little bit silly to have both the `pillai2` function and the `bhatt` function when they’re mostly similar. So, we can actually combine them into one function, if that makes sense to you. We’ll create a new function called `overlap`. The first part is the same as both of the other functions.

This section only works if you want to use `pillai2`. If you want to be flexible in the MANOVA formula, you’re stuck with `pillai` and you can’t really combine them like this.

The difference is I create a new argument called `method` and by default it’s `"pillai"`. I then do an `if` statement that basically does different things depending on what is in that `method` argument. If it’s `"pillai"`, then it’ll return the Pillai score. If it’s `"BA"` it’ll do the Bhattacharyya’s Affinity. If it’s something else, it’ll give a little warning message saying that it was an improper method and will return `NA`.

``````overlap <- function(F1, F2, vowel, method = "pillai") {

vowel_data <- data.frame(vowel) %>%
droplevels()

if (nrow(vowel_data) < 1) return(NA) # not zero
if (length(table(vowel_data)) < 2) return(NA) # both vowels are represented
if (min(table(vowel_data)) < 5) return(NA) # there are at least 5 of each vowel

if (method == "pillai") {
summary(manova(cbind(F1, F2) ~ vowel))\$stats["vowel", "Pillai"]
} else if (method == "BA") {
} else {
warning("Improper method")
return(NA)
}
}``````

The result is that we just have one function, `overlap`, that takes care of both measurements.

``````all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = overlap(F1, F2, vowel),
bhatt  = overlap(F1, F2, vowel, method = "BA"))``````
``````# A tibble: 8 × 4
# Groups:   vowel_pair [4]
vowel_pair fake_speaker pillai   bhatt
<chr>      <chr>         <dbl>   <dbl>
1 fail fell  Joey          0.825  0.124
2 fail fell  Stanley       0.849  0.0830
3 low back   Joey          0.117  0.892
4 low back   Stanley       0.301  0.728
5 pin pen    Joey          0.731  0.291
6 pin pen    Stanley       0.708  0.453
7 pull pole  Joey         NA     NA
8 pull pole  Stanley      NA     NA     ``````

This is kinda cool I guess. If you like that better, go for it. But if you like the `bhatt` and `pillai2` functions better, use those instead.

Now, if you wanted to get really fancy, you could have the best of both worlds. You could retain the `pillai2` and `bhatt` functions for the user, but under the hood they actually do the same thing.

``````pillai2 <- function(...) {
overlap(..., method = "pillai")
}
bhatt <- function(...) {
overlap(..., method = "BA")
}
all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = pillai2(F1, F2, vowel),
bhatt  = bhatt(F1, F2, vowel))``````
``````# A tibble: 8 × 4
# Groups:   vowel_pair [4]
vowel_pair fake_speaker pillai   bhatt
<chr>      <chr>         <dbl>   <dbl>
1 fail fell  Joey          0.825  0.124
2 fail fell  Stanley       0.849  0.0830
3 low back   Joey          0.117  0.892
4 low back   Stanley       0.301  0.728
5 pin pen    Joey          0.731  0.291
6 pin pen    Stanley       0.708  0.453
7 pull pole  Joey         NA     NA
8 pull pole  Stanley      NA     NA     ``````

If I were to put together an R package with these functions, I’d do something like that. It’s easiest to maintain while still being user-friendly.

Incidentally, if you want to help me create an R package with all this stuff, let’s talk.

## Reshape and visualize

The last thing I’ll do in this lengthy tutorial is to look at how make a couple visualizations for pillai scores and how to transform your data along the way. But, this is best illustrated with more speakers, so I’m going to do take my data and duplicate it a bunch of times to simulate the effect of a larger sample. I’ll then randomly divide that data up into eight fake speakers.

``````fake_data <- rbind(all_pairs, all_pairs, all_pairs, all_pairs, all_pairs, all_pairs, all_pairs, all_pairs) %>%
filter(vowel_pair != "pull pole") %>%
mutate(fake_speaker = sample(c("Andy", "Betty", "Chuck", "Dolly", "Earl", "Flo", "Gary", "Helga"), nrow(.), replace = TRUE))
sample_n(fake_data, 10)``````
``````   vowel_pair vowel        word  dur    F1     F2 fol_seg plt_manner
1    low back    AA   DOCUMENTS 0.10 623.6 1200.6       K       stop
2     pin pen    EH  IDENTIFIED 0.05 489.4 1707.0       N      nasal
3    low back    AA         FOX 0.07 633.4 1102.7       K       stop
4     pin pen    EH DEMONSTRATE 0.05 420.2 1549.5       M      nasal
5     pin pen    IH     WINNING 0.05 377.2 1678.3       N      nasal
6    low back    AA        BODY 0.11 682.6 1103.2       D       stop
7     pin pen    EH      MENTAL 0.06 570.5 1678.6       N      nasal
8    low back    AA  IMPOSSIBLE 0.09 576.5 1042.1       S  fricative
9    low back    AO       AWFUL 0.13 624.2 1101.2       F  fricative
10   low back    AA      SOCCER 0.11 651.3 1127.5       K       stop
plt_place plt_voice fake_speaker
1        velar voiceless         Earl
2       apical    voiced        Chuck
3        velar voiceless         Andy
4       labial    voiced         Earl
5       apical    voiced         Earl
6       apical    voiced         Earl
7       apical    voiced        Helga
8       apical voiceless         Andy
9  labiodental voiceless         Andy
10       velar voiceless         Earl``````

When you run the `pillai` and `bhatt` functions, the results is a table such that each combination of speaker and vowel pair are on their own unique row.

``````pillai_table <- fake_data %>%
group_by(fake_speaker, vowel_pair) %>%
summarize(pillai = pillai2(F1, F2, vowel))
pillai_table``````
``````# A tibble: 24 × 3
# Groups:   fake_speaker [8]
fake_speaker vowel_pair pillai
<chr>        <chr>       <dbl>
1 Andy         fail fell   0.805
2 Andy         low back    0.265
3 Andy         pin pen     0.712
4 Betty        fail fell   0.883
5 Betty        low back    0.188
6 Betty        pin pen     0.622
7 Chuck        fail fell   0.828
8 Chuck        low back    0.236
9 Chuck        pin pen     0.645
10 Dolly        fail fell   0.801
# ℹ 14 more rows``````

This structure may be handy for some types of visualizations. For example, you can make a plot like this which shows the spread of Pillai scores across each vowel pair. I’ve underlaid a violin plot so you can see the distribution a little better.

``````ggplot(pillai_table, aes(vowel_pair, pillai)) +
geom_violin(color = "gray75") +
geom_text(aes(label = fake_speaker))``````

But you may need to rearrange your data in a different way. For example, you may want just one speaker per row and each vowel pair in its own column. Fortunately, the `tidyr` package has a really handy function called `spread` that’ll do this. There are two arguments to `spread`:

I explain this `spread` function in a little more detail in my second tutorial on plotting formant data.
1. The name of the column whose values you want to spread out into their own columns. In our case, the `vowel_pair` column contains the four different pairs.

2. The name of the existing columns whose values you want to be contained in the new columns. In our case, the `pillai` column contains all those numbers and we want those numbers to fill the cells of the new columns being created.

So, if we put it together, this is what we get.

``````pillai_table_wide <- pillai_table %>%
pillai_table_wide``````
``````# A tibble: 8 × 4
# Groups:   fake_speaker [8]
fake_speaker `fail fell` `low back` `pin pen`
<chr>              <dbl>      <dbl>     <dbl>
1 Andy               0.805      0.265     0.712
2 Betty              0.883      0.188     0.622
3 Chuck              0.828      0.236     0.645
4 Dolly              0.801      0.282     0.796
5 Earl               0.828      0.284     0.769
6 Flo                0.849      0.229     0.663
7 Gary               0.827      0.194     0.668
8 Helga              0.803      0.244     0.709``````

Now, with our data reshaped in this way, we could do a scatterplot to see the correlation between two mergers:

There are ticks around `low back` and `pin pen` because those are the column names themselves and when you refer to column names that have spaces in them you have to surround them with ticks. These show up in the plot itself, but you can fix that with `labs`.
``````ggplot(pillai_table_wide, aes(`low back`, `pin pen`)) +
geom_text(aes(label = fake_speaker))``````

Because I’ve randomly split by data up, there’s really no pattern, but if you suspect a correlation between mergers in your dataset, this plot might be more informative for you.

Another plot you may want to see is the correlation between the Pillai score and the Bhattacharyya’s Affinity. We’ll calculate both of them at once and create an `overlap_table` object.

``````overlap_table <- fake_data %>%
group_by(fake_speaker, vowel_pair) %>%
summarize(pillai = pillai2(F1, F2, vowel),
bhatt  = bhatt(F1, F2, vowel))
overlap_table``````
``````# A tibble: 24 × 4
# Groups:   fake_speaker [8]
fake_speaker vowel_pair pillai  bhatt
<chr>        <chr>       <dbl>  <dbl>
1 Andy         fail fell   0.805 0.118
2 Andy         low back    0.265 0.802
3 Andy         pin pen     0.712 0.389
4 Betty        fail fell   0.883 0.0817
5 Betty        low back    0.188 0.831
6 Betty        pin pen     0.622 0.385
7 Chuck        fail fell   0.828 0.119
8 Chuck        low back    0.236 0.827
9 Chuck        pin pen     0.645 0.424
10 Dolly        fail fell   0.801 0.128
# ℹ 14 more rows``````

Now, with this table, we can do another scatterplot, but pitting the two measurements against each other. We’d expect a negative correlation since overlap in Pillai means a 0 and in Bhattacharyya’s Affinity it means 1. But can we find anything that stands out? Here, I’ll plot all the data, but color the names by vowel pair.

``````ggplot(overlap_table, aes(pillai, bhatt, color = vowel_pair)) +
geom_segment(x = 1, y = 0, xend = 0, yend = 1, color = "gray50") +
geom_text(aes(label = fake_speaker))``````

So this is kind of cool. For the low back merger, the distribution is tight and there’s not a lot going on. For the other two, they’re more spread out. We don’t see too many points straying too far from the line, so there’s not a lot of concern here. Maybe because it’s just a bunch of random samples of my own data it’s not particularly enlightening. Perhaps if you try this on your own data, you might find some interesting differences between the two measurements, which will require some digging.

## Conclusion

So that’s it. In this post I covered more of the nitty-gritty detail on a couple topics relating to getting vowel overlap measurements in R. As you can tell, there are some residual problems, like coming up with one function that is flexible enough to allow for any MANOVA formula but still allow for data validation. But the once those formulas are written, you can do some pretty cool stuff in just a couple lines of code. Plus, the visuals are straightforward to implement. Hopefully, thanks to this tutorial, the coding itself is no longer a barrier for you if you’ve been meaning to get vowel overlap measurements for your data.