Vowel overlap in R: More advanced topics
This is a continuation of my previous tutorial on how to calculate Pillai scores and Bhattacharyya’s Affinity in R for the purposes of measuring vowel overlap. It occurred to me as I was putting the previous one together though that I had a lot of things to say and the tutorial got really long and complicated. So I moved all the more advanced topics to this one to keep the main one a little lighter and more approachable.
In this post, I’ll cover some topics like what to do if you have multiple vowel pairs you want to measure in each speaker, errors you may encounter with the select
function when you’re calculating Bhattacharyya’s Affinity, ways at making the functions less error-prone, and some visualizations you can do with your data after you’ve collected it.
Update (November 23, 2021): Betsy Sneller and I have done some recent research on Pillai scores. Please see the summary of our ASA2021 poster (and the poster itself) for more information.
Data prep
The data prep for this post is covered in Part 1 already, so I’ll put the code here without further comment.
library(tidyverse)
my_vowels <- read.csv("../data/joey.csv") %>%
filter(stress == 1) %>%
select(vowel, word, dur, F1.50., F2.50., fol_seg, plt_manner, plt_place, plt_voice) %>%
rename(F1 = F1.50., F2 = F2.50.) %>%
mutate(fake_speaker = sample(c("Joey", "Stanley"), nrow(.), replace = TRUE))
low_back <- my_vowels %>%
filter(vowel %in% c("AA", "AO"),
!fol_seg %in% c("L", "R"),
word != "ON")
So where I left off at the last tutorial was a function that can calculate the Pillai score and another for the Bhattacharyya’s Affinity, each with the help of summarize
:
pillai <- function(...) {
summary(manova(...))$stats["vowel","Pillai"]
}
bhatt <- function (F1, F2, vowel) {
vowel_data <- droplevels(data.frame(vowel))
sp_df <- SpatialPointsDataFrame(cbind(F1, F2), vowel_data)
kerneloverlap(sp_df, method='BA')[1,2]
}
low_back %>%
summarize(low_back_pillai = pillai(cbind(F1, F2) ~ vowel))
## low_back_pillai low_back_bhatt
## 1 0.03258069 0.9208103
Okay cool. Let’s see what else can be done with this.
Multiple vowel pairs
What I’ve shown so far is how to calculate the Pillai score for multiple speakers for a single pair of vowels. The next question is what how to get the Pillai score for multiple speakers for multiple pairs of vowels.
Unfortunately, I don’t really know of a quick way to do that. The problem is a single vowel might be used in multiple pairs (like measuring
The easiest solution I can think of is to define separate datasets like we’ve done with low_back
and run the same code on them. That way, you have all the flexibility of defining specific allophones for each question. And, you probably won’t be getting the Pillai score on too many pairs of vowels, right?
Let’s make two more subsets. First, I’ll look at the pin-pen merger which I don’t really have, but is common in the South where I live right now. (I’ll also remove some stop words from the dataset.) I’ll also look at the fail-fell merger, which I also don’t really have, but can be found in places like Utah and parts of Texas. I’ll create these subsets and plot them so you can see what they look like in my own speech. Maybe you can roughly estimate the Pillai score.
pin_pen <- my_vowels %>%
filter(vowel %in% c("IH", "EH"), fol_seg %in% c("M", "N"), !word %in% c("IN", "INTO"))
head(pin_pen)
## vowel word dur F1 F2 fol_seg plt_manner plt_place plt_voice fake_speaker
## 1 IH INJURY 0.08 394.6 2242.6 N nasal apical voiced Stanley
## 2 EH MEANT 0.06 448.8 1631.9 N nasal apical voiced Joey
## 3 EH MENTAL 0.06 570.5 1678.6 N nasal apical voiced Stanley
## 4 EH FENCE 0.07 364.9 1575.5 N nasal apical voiced Stanley
## 5 EH ENTER 0.12 482.7 1738.0 N nasal apical voiced Joey
## 6 EH AM 0.13 583.7 1641.7 M nasal labial voiced Stanley
ggplot(pin_pen, aes(F2, F1, color = vowel, label = word)) +
geom_text(size = 3) +
scale_x_reverse() + scale_y_reverse()
fail_fell <- my_vowels %>%
filter(vowel %in% c("EY", "EH"), fol_seg =="L")
head(fail_fell)
## vowel word dur F1 F2 fol_seg plt_manner plt_place plt_voice fake_speaker
## 1 EY AILMENTS 0.06 387.3 2149.5 L lateral apical voiced Stanley
## 2 EY SAILOR 0.08 382.5 1898.5 L lateral apical voiced Joey
## 3 EH CELLPHONE 0.09 509.4 1549.1 L lateral apical voiced Joey
## 4 EY SURVEILLANCE 0.09 468.9 2104.8 L lateral apical voiced Joey
## 5 EY TAYLOR 0.05 311.7 2069.7 L lateral apical voiced Joey
## 6 EH ELSE 0.06 583.1 1343.6 L lateral apical voiced Stanley
ggplot(fail_fell, aes(F2, F1, color = vowel, label = word)) +
geom_text(size = 3) +
scale_x_reverse() + scale_y_reverse()
Okay, so now we’re ready to calculate Pillai scores to those new pairs. Because we’ve created the pillai
function, we can relatively easily apply it to each of these datasets.
pin_pen %>%
summarize(pin_pen_pillai = pillai(cbind(F1, F2) ~ vowel))
## pin_pen_pillai
## 1 0.2861066
fail_fell %>%
summarize(fail_fell_pillai = pillai(cbind(F1, F2) ~ vowel))
## fail_fell_pillai
## 1 0.7657014
So this is pretty cool. Here you can see that my fail-fell vowel classes are quite distinct with a Pillai score of 0.76. Meanwhile, my pin-pen vowels are a bit more overlapped with a Pillai score of 0.28, but not as much as the low back vowels.
So we’ve run the same thing three times on three different datasets. But can we do this even more elegantly? Sure!
Okay, so first, I’ll combine all three datasets (low_back
, pin_pen
and fail_fell
) into one combined dataframe using bind_rows
. But in order to “keep track” of which dataframe a particular row came from, I’ll create a new column called vowel_pair
using the .id = "vowel_pair"
argument. To get this to work then, I’ll “name” each of the dataframes as I combine it. So for example, the low_back
dataframe is called "low back"
and I show that by putting the new name between ticks (that little thing next to the number 1 key on my keyboard that looks like this: `). If you combine the dataframes in this way, you’ll have a new column saying which one it came from.
all_pairs <- bind_rows(`low back` = low_back,
`pin pen` = pin_pen,
`fail fell` = fail_fell,
.id = "vowel_pair")
## vowel_pair vowel word dur F1 F2 fol_seg plt_manner plt_place plt_voice fake_speaker
## 1 low back AA TODD 0.17 614.6 1065.2 D stop apical voiced Stanley
## 2 low back AA GOD 0.09 554.0 1250.3 D stop apical voiced Joey
## 3 low back AA OPERATOR 0.09 917.0 1273.3 P stop labial voiceless Joey
## 4 low back AA FATHER 0.12 598.6 1000.2 DH fricative interdental voiced Joey
## 5 low back AO WATER 0.05 587.5 986.7 T stop apical voiceless Joey
## 6 low back AA LOT 0.14 770.0 1068.7 T stop apical voiceless Stanley
## ... ... ... ... ... ... ... ... ... ... ... ...
## 248 fail fell EY GAIL 0.12 431.1 1908.6 L lateral apical voiced Joey
## 249 fail fell EY AILMENTS 0.12 438.0 2048.3 L lateral apical voiced Stanley
## 250 fail fell EY SAIL 0.06 423.7 1797.2 L lateral apical voiced Joey
## 251 fail fell EH FELL 0.05 502.3 1539.2 L lateral apical voiced Joey
## 252 fail fell EY WAILING 0.12 421.3 2086.7 L lateral apical voiced Stanley
## 253 fail fell EY STALE 0.14 437.9 1862.4 L lateral apical voiced Stanley
Now that may seem like more work than it’s worth, but the payoff is that when you actually go to do the Pillai scores, it’s hardly any more complicated than before. In the group_by
function, just put that you want to group the data by the vowel pair and by the speaker and it’ll magically do the rest for you.
all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = pillai(cbind(F1, F2) ~ vowel))
## # A tibble: 6 x 3
## # Groups: vowel_pair [3]
## vowel_pair fake_speaker pillai
## <chr> <chr> <dbl>
## 1 fail fell Joey 0.793
## 2 fail fell Stanley 0.732
## 3 low back Joey 0.0755
## 4 low back Stanley 0.0144
## 5 pin pen Joey 0.172
## 6 pin pen Stanley 0.573
In my opinion, it’s worth it to do the extra bit of work beforehand prepping your data with the whole bind_rows
business and creating the new function because the benefit is that when it actually comes time to calculate the Pillai score, you’ve got a tidy dataset to work with and a flexible function to use.
The select
clash and Bhattacharyya’s Affinity
In Dan Johnson’s NWAV presentation where he introduces Bhattacharyya’s Affinity, he installs and loads an additional package sp
so that the SpatialPointsData-Frame
function can be used. This is technically not necessary to do explicitly, because sp
is a dependency of adehabitatHR
, meaning when you install and load adehabitatHR
you also are bringing sp
along too.
Okay, in Part 1, I talked about the library needed to run Bhattacharyya’s Affinity, adehabitatHR
. However, it always seemed to be the case that my tidyverse code, particularly the select
function always broke in scripts that used this package. It was super annoying and for the longest time I couldn’t figure out why.
As it turns out, when you load the adehabitatHR
package, it also loads the MASS
package. Unfortunately for us, there’s a function in MASS
called select
. Well, dplyr
also has a select
function. Having two select
functions at the same time creates a clash. So, R chooses the one that was loaded most recently, which is MASS
(via adehabitatHR
).
You can actually see this in a warning when you load adehabitatHR
if dplyr
(via tidyverse
) is already installed:
To replicate the messages and output in this section you’ll have to restart R and load the packages again from scratch in this order. To do that, go to Session→Restart R.
library(tidyverse)
library(adehabitatHR)
## ...output truncated...
## Loading required package: MASS
## Attaching package: ‘MASS’
## The following object is masked from ‘package:dplyr’:
## select
## ...output truncated...
And when you go to use select
, R thinks it should run MASS::select
instead of dplyr::select
:
low_back %>%
select(vowel)
## Error in select(., vowel) : unused argument (vowel)
So, there are two solutions to this. First, you can simply take advantage of the fact that the most recently loaded package takes priority, and load adehabitatHR
before you load tidyverse
library(adehabitatHR)
library(tidyverse)
low_back %>%
select(vowel)
## vowel
## 1 AA
## 2 AA
## 3 AA
## 4 AA
## 5 AO
## 6 AA
That’ll ensure that dplyr::select
takes precedence. However, I’ve never really liked this for a couple reasons. One, I’m stubborn, and I like to load my packages in a specific order (basically the order that I use them) and at the top of my code it seems weird to load a package before tidyverse
. Also, it means that it I’m already going along in an R session, with tidyverse
loaded already, if I want to then do some Bhattacharyya’s Affinity, I’ll need to either unload the library or restart R, which is never really all that convenient.
The other option is to not actually load the adehabitatHR
package at all. You can still use functions when a package is not loaded (as long as you already have them installed to your computer) by prefacing the function name with the name of the package and a pair of colons. So when it comes time to use the one function I need from adehabitatHR
, the kerneloverlap
function, I’ll just type adehabitatHR::kerneloverlap
. In my opinion, this is a better route just because I don’t like loading an entire package (and reworking lots of others things in my code) just to make sure of one function.
For the remainder of this tutorial, I’m going to implement this second option. And for consistency, I’ll also not load the sp
package either and call SpatialPointsDataFrame
using sp::
prefixed before it.
Making the functions more robust
In this section, I’ll go into some more detail about how to make your functions less likely to crash. Some of the topics here get tedious and cover use some more advanced R skills. If you’ve been finding that your code breaks when you apply it to a bunch of speakers, this might help with these issues.
I’ll start with the bhatt
function because it’s a little more straightforward. Then we’ll get into some slightly more confusing stuff with pillai
.
Making bhatt
more robust
I’ve noticed when running Bhattacharyya’s affinity on my data that it tends to crash if certain conditions aren’t met. For example, the calculation requires at least five observations from each vowel class to work. So, let’s say I wanted to look at the pull-poleI think I have this merger, but I’m really not sure. I thought studying it in my data would help my own intuitions, but now I’m always hyper aware of the relatively small group of relevant words. But I digress… merger (that is, the merger of
pull_pole <- my_vowels %>%
filter(vowel %in% c("UH", "OW"), fol_seg %in% c("L")) %>%
droplevels()
head(pull_pole)
## vowel word dur F1 F2 fol_seg plt_manner plt_place plt_voice fake_speaker
## 1 OW BOLD 0.13 462.6 707.9 L lateral apical voiced Stanley
## 2 OW GOAL 0.22 366.7 827.0 L lateral apical voiced Joey
## 3 OW COLD 0.06 546.0 892.7 L lateral apical voiced Stanley
## 4 OW POLO 0.19 418.2 893.9 L lateral apical voiced Joey
## 5 OW COLD 0.11 349.7 667.2 L lateral apical voiced Joey
## 6 OW ROLLS 0.11 385.7 785.1 L lateral apical voiced Stanley
But when we run it…
pull_pole %>%
summarize(pull_pole_bhatt = bhatt(F1, F2, vowel))
## Error in kernelUD(xy, same4all = TRUE, ...): At least 5 relocations
# are required to fit an home range
It crashes. You can see in the error message that it says you need at least five “relocations” per “home range”. You can tell this package was intended for animal location data instead of vowel data. But we can see that there aren’t five words in the
table(pull_pole$vowel)
## OW UH
## 75 1
There’s just token. This is problematic because if I’m going to apply this function to like 30 speakers, even if one speaker has insufficient data I want it to work on everyone else without crashing.
So, how can I make the bhatt
function a little more robust? Expert R users might know of fancier ways to do what I’m about to do, but I just put a little if
statement in there: if the lowest value in that table is less than five, then return
—meaning abort the function—with the value NA
.
bhatt <- function (F1, F2, vowel) {
vowel_data <- data.frame(vowel) %>%
droplevels()
if (min(table(vowel_data)) < 5) return(NA) # there are at least 5 of each vowel
adehabitatHR::kerneloverlap(sp::SpatialPointsDataFrame(cbind(F1, F2), vowel_data), method='BA')[2,1]
}
pull_pole %>%
summarize(pull_pole_bhatt = bhatt(F1, F2, vowel))
## pull_pole_bhatt
## 1 NA
Now, when I run it, it won’t crash anymore. It won’t return a number and instead it’ll just return NA
, but at least it doesn’t crash. You can appreciate now how handy this might be when we combine all the data together and then then the function on all vowel pairs.
all_pairs <- bind_rows(`low back` = low_back,
`pin pen` = pin_pen,
`fail fell` = fail_fell,
`pull pole` = pull_pole,
.id = "vowel_pair")
all_pairs %>%
group_by(vowel_pair) %>%
summarize(bhatt = bhatt(F1, F2, vowel))
## # A tibble: 4 x 2
## vowel_pair bhatt
## <chr> <dbl>
## 1 fail fell 0.262
## 2 low back 0.921
## 3 pin pen 0.790
## 4 pull pole NA
Now, the function will work just fine on the vowel pairs with enough data and will quietly return NA
if there’s not enough data.
As it turns out, there were some other problems I ran into in my data, so I had to make the function even more robust. I’ve gone ahead and added a couple other warnings as well. So like one time I didn’t have any tokens of a particular vowel, but plenty of the other. So after applying droplevels
, only one vowel remained, so then when I ran table
, the lowest number in that table was indeed higher than 5, but it crashed because there was only one vowel there. So, I added a check to makes sure that the table contained exactly two vowel tokens.
Finally, I found out that sometimes there will be a speaker or a vowel pair that happens to have no data. Zero tokens of either vowel. In theory, the other two checks should handle it, but it turns out they themselves crash if there’s zero data. So, I added that first line, if (nrow(vowel_data) < 1) return(NA)
, before the other checks are done, to ensure that there’s at least something there.
So, the complete function looks like this:
bhatt <- function (F1, F2, vowel) {
vowel_data <- data.frame(vowel) %>%
droplevels()
# Make sure there's enough data
if (nrow(vowel_data) < 1) return(NA) # not zero
if (length(table(vowel_data)) < 2) return(NA) # both vowels are represented
if (min(table(vowel_data)) < 5) return(NA) # there are at least 5 of each vowel
adehabitatHR::kerneloverlap(sp::SpatialPointsDataFrame(cbind(F1, F2), vowel_data), method='BA')[2,1]
}
This version of the function is the one that I use in my data. It’s a little more robust than the simpler version and it can safely iterate over whatever groups you want.
low_back %>%
group_by(fake_speaker) %>%
summarize(bhatt = bhatt(F1, F2, vowel))
## # A tibble: 2 x 2
## fake_speaker bhatt
## <chr> <dbl>
## 1 Joey 0.952
## 2 Stanley 0.827
all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(bhatt = bhatt(F1, F2, vowel))
## # A tibble: 8 x 3
## # Groups: vowel_pair [4]
## vowel_pair fake_speaker bhatt
## <chr> <chr> <dbl>
## 1 fail fell Joey 0.166
## 2 fail fell Stanley 0.328
## 3 low back Joey 0.952
## 4 low back Stanley 0.827
## 5 pin pen Joey 0.727
## 6 pin pen Stanley 0.826
## 7 pull pole Joey NA
## 8 pull pole Stanley NA
So that’s it for the Bhattacharyya’s Affinity (for now). Now let’s more on to pillai
.
Trying to make pillai
more robust
Because of the way pillai
is implemented right now, particularly with the ...
syntax, it’s a little tricky to add the same data validation checks that we added in the bhatt
function above. In fact, I spent a bit of time on it, but the main hurdle is that the manova
function requires things to be in a formula (that is, something like y ~ x
). So, for now, I can’t offer a perfect solution to the pillai
function.
Instead, I offer a less-than-ideal solution: two different functions, each with their strengths and weaknesses.
The first is pillai
, which is exactly how we have it already (repeated here for clarity):
pillai <- function(...) {
summary(manova(...))$stats["vowel","Pillai"]
}
The good part about this function is that whatever formula you want to use, works perfectly fine. So if you want to add duration or F3 to the dependent variable matrix, or pack on all sorts of social or phonological factors, go for it and it’ll work as long as manova
can handle it.
The bad part about this is that we can’t add any additional data validation procedures to account for potentially missing data. So like if I try to run it on the pull_pole
dataset like I did with bhatt
, it’ll crash.
all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai_score = pillai(cbind(F1, F2) ~ vowel))
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]):
## contrasts can be applied only to factors with 2 or more levels
Again, this shows that I don’t fully understand the math behind the MANOVA test. I wish I did…
Now, apparently manova
can handle less data than kerneloverlap
can, because the only reason that crashed is because one of my fake speakers had zero tokens of
all_pairs %>%
group_by(vowel_pair) %>%
summarize(pillai_score = pillai(cbind(F1, F2) ~ vowel))
## # A tibble: 4 x 2
## vowel_pair pillai_score
## <chr> <dbl>
## 1 fail fell 0.766
## 2 low back 0.0326
## 3 pin pen 0.286
## 4 pull pole 0.0103
I’ll admit, I’m a little skeptical of a Pillai score where one of the vowels only has one token in it. And sure enough, if we look at the model output, the p-value is high, probably because there’s just not a lot of data.
summary(manova(cbind(F1, F2) ~ vowel, data = pull_pole))
## Df Pillai approx F num Df den Df Pr(>F)
## vowel 1 0.01025 0.378 2 73 0.6866
## Residuals 74
So, if I could, I would still implement the data validation checks like I did with bhatt
to ensure there’s enough data before running so that I can feel more confident about the values I’m getting, rather than getting values without any sort of warning message informing me of the potentially bad result.
Well, one solution is to create a separate function that can do the checks for sufficient data. I’ll create one called pillai2
that actually mimics the syntax used in bhatt
.
pillai2 <- function(F1, F2, vowel) {
vowel_data <- data.frame(vowel) %>%
droplevels()
if (nrow(vowel_data) < 1) return(NA) # not zero
if (length(table(vowel_data)) < 2) return(NA) # both vowels are represented
if (min(table(vowel_data)) < 5) return(NA) # there are at least 5 of each vowel
summary(manova(cbind(F1, F2) ~ vowel))$stats["vowel","Pillai"]
}
Now, when I call pillai2
, I don’t need to use cbind
or the formula syntax. Instead, I just give it the name of the columns corresponding to my F1, F2, and vowel columns.
all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = pillai2(F1, F2, vowel))
## # A tibble: 8 x 3
## # Groups: vowel_pair [4]
## vowel_pair fake_speaker pillai
## <chr> <chr> <dbl>
## 1 fail fell Joey 0.748
## 2 fail fell Stanley 0.805
## 3 low back Joey 0.0429
## 4 low back Stanley 0.0377
## 5 pin pen Joey 0.353
## 6 pin pen Stanley 0.249
## 7 pull pole Joey NA
## 8 pull pole Stanley NA
So, yay, right? The problem with this version of the pillai2
function is that you can’t control the MANOVA formula. So if you want to run a MANOVA that controls for things like following place of articulation, this function, pillai2
, won’t be able to do that, unlike the previous version, pillai
.
My recommendation is that if you want to add additional variables, use the original pillai
function (not this pillai2
that we just made) but do the data validation separately.
Switching gears one more time, there’s one more thing we can do to the pillai
function to make it more robust. Right now, it assumes that the variable in your dataframe is called “vowel”. Sometimes, that might not be the case. I sometimes use “vowel_class” if I’m working with specific subsets of the data. If I run the pillai
function as it is on a dataframe like that, it’ll crash.
all_pairs %>%
rename(vowel_class = vowel) %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = pillai2(F1, F2, vowel))
## Error in data.frame(vowel): object 'vowel' not found
So, we’ll have to add a little more robustness to the function. You could use the number 1
instead of "vowel"
because the vowel argument is first in our formula.
pillai <- function(...) {
summary(manova(...))$stats[1,"Pillai"]
}
pillai(cbind(F1, F2, dur) ~ vowel + plt_place + plt_manner + plt_voice, data = low_back)
## [1] 0.03541762
But that assumes the vowel argument is first, which it might not always be. Here, I’ll put the plt_place
variable first after the tilde, and you’ll notice that the result changes.
pillai <- function(...) {
summary(manova(...))$stats[1,"Pillai"]
}
pillai(cbind(F1, F2, dur) ~ plt_place + plt_manner + plt_voice + vowel, data = low_back)
## [1] 0.1890787
The solution here is to either make sure the vowel
column is first in your function or to rename whatever it is that your column is called to “vowel.” I don’t like either of those solutions personally but, yet again, I don’t really have a good solution. Using the number 1
might be slightly better than changing the column name because it doesn’t force you to change the underlying structure of your dataframe.
However, if you like the pillai2
function more—the one that has some robustness already built into it but is limited to just one independent variable—you’re in luck because it actually can handle different column names already. Regardless of what my vowel column is called, when it gets passed to the function, it is temporarily renamed "vowel"
. So, this will still work:
all_pairs %>%
rename(this_is_a_random_long_name = vowel) %>%
group_by(vowel_pair) %>%
summarize(pillai = pillai2(F1, F2, this_is_a_random_long_name))
## # A tibble: 4 x 2
## vowel_pair pillai
## <chr> <dbl>
## 1 fail fell 0.766
## 2 low back 0.0326
## 3 pin pen 0.286
## 4 pull pole NA
So again, there’s a toss up between pillai
, which can handle any MANOVA function, and pillai2
which is less error prone. Perhaps in the future I’ll be able to find a way to combine both into one super robust pillai
function. For now, you know as much as I do.
Miscellaneous material
At this point, we’re done; I won’t be playing around with the functions anymore. Instead, in this last section, I’ll look at some fun tricks when using them and how to reshape the output to be more useful for you. Again, parts of this section get a little tedious and use some tricksy R stuff.
Combining functions
This section only works if you want to use pillai2
. If you want to be flexible in the MANOVA formula, you’re stuck with pillai
and you can’t really combine them like this.
Sometimes, it’s a little bit silly to have both the pillai2
function and the bhatt
function when they’re mostly similar. So, we can actually combine them into one function, if that makes sense to you. We’ll create a new function called overlap
. The first part is the same as both of the other functions.
The difference is I create a new argument called method
and by default it’s "pillai"
. I then do an if
statement that basically does different things depending on what is in that method
argument. If it’s "pillai"
, then it’ll return the Pillai score. If it’s "BA"
it’ll do the Bhattacharyya’s Affinity. If it’s something else, it’ll give a little warning message saying that it was an improper method and will return NA
.
overlap <- function(F1, F2, vowel, method = "pillai") {
vowel_data <- data.frame(vowel) %>%
droplevels()
if (nrow(vowel_data) < 1) return(NA) # not zero
if (length(table(vowel_data)) < 2) return(NA) # both vowels are represented
if (min(table(vowel_data)) < 5) return(NA) # there are at least 5 of each vowel
if (method == "pillai") {
summary(manova(cbind(F1, F2) ~ vowel))$stats["vowel", "Pillai"]
} else if (method == "BA") {
adehabitatHR::kerneloverlap(sp::SpatialPointsDataFrame(cbind(F1, F2), vowel_data), method='BA')[2,1]
} else {
warning("Improper method")
return(NA)
}
}
The result is that we just have one function, overlap
, that takes care of both measurements.
all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = overlap(F1, F2, vowel),
bhatt = overlap(F1, F2, vowel, method = "BA"))
## # A tibble: 8 x 4
## # Groups: vowel_pair [4]
## vowel_pair fake_speaker pillai bhatt
## <chr> <chr> <dbl> <dbl>
## 1 fail fell Joey 0.748 0.279
## 2 fail fell Stanley 0.805 0.283
## 3 low back Joey 0.0429 0.892
## 4 low back Stanley 0.0377 0.925
## 5 pin pen Joey 0.353 0.736
## 6 pin pen Stanley 0.249 0.846
## 7 pull pole Joey NA NA
## 8 pull pole Stanley NA NA
This is kinda cool I guess. If you like that better, go for it. But if you like the bhatt
and pillai2
functions better, use those instead.
Now, if you wanted to get really fancy, you could have the best of both worlds. You could retain the pillai2
and bhatt
functions for the user, but under the hood they actually do the same thing.
pillai2 <- function(...) {
overlap(..., method = "pillai")
}
bhatt <- function(...) {
overlap(..., method = "BA")
}
all_pairs %>%
group_by(vowel_pair, fake_speaker) %>%
summarize(pillai = pillai2(F1, F2, vowel),
bhatt = bhatt(F1, F2, vowel))
Incidentally, if you want to help me create an R package with all this stuff, let’s talk. If I were to put together an R package with these functions, I’d do something like that. It’s easiest to maintain while still being user-friendly.
Reshape and visualize
The last thing I’ll do in this lengthy tutorial is to look at how make a couple visualizations for pillai scores and how to transform your data along the way. But, this is best illustrated with more speakers, so I’m going to do take my data and duplicate it a bunch of times to simulate the effect of a larger sample. I’ll then randomly divide that data up into eight fake speakers.
fake_data <- rbind(all_pairs, all_pairs, all_pairs, all_pairs, all_pairs, all_pairs, all_pairs, all_pairs) %>%
filter(vowel_pair != "pull pole") %>%
mutate(fake_speaker = sample(c("Andy", "Betty", "Chuck", "Dolly", "Earl", "Flo", "Gary", "Helga"), nrow(.), replace = TRUE))
sample_n(fake_data, 10)
## vowel_pair vowel word dur F1 F2 fol_seg plt_manner plt_place plt_voice fake_speaker
## 1 low back AA DOCK 0.16 682.6 1211.6 K stop velar voiceless Betty
## 2 fail fell EH FELL 0.06 417.8 1450.4 L lateral apical voiced Dolly
## 3 pin pen EH DENNIS 0.09 533.9 1532.8 N nasal apical voiced Andy
## 4 pin pen EH ENTRIES 0.08 433.7 1855.1 N nasal apical voiced Betty
## 5 low back AA OPS 0.10 582.2 1152.2 P stop labial voiceless Flo
## 6 low back AA BOX 0.15 584.0 1088.5 K stop velar voiceless Dolly
## 7 low back AA POCKETS 0.13 658.4 1170.9 K stop velar voiceless Andy
## 8 pin pen EH END 0.09 359.4 1916.0 N nasal apical voiced Helga
## 9 low back AA GOT 0.11 674.2 1281.2 T stop apical voiceless Gary
## 10 low back AA COPY 0.10 1129.2 2366.5 P stop labial voiceless Betty
When you run the pillai
and bhatt
functions, the results is a table such that each combination of speaker and vowel pair are on their own unique row.
pillai_table <- all_pairs %>%
group_by(fake_speaker, vowel_pair) %>%
summarize(pillai = pillai2(F1, F2, vowel))
pillai_table
## # A tibble: 8 x 3
## # Groups: fake_speaker [2]
## fake_speaker vowel_pair pillai
## <chr> <chr> <dbl>
## 1 Joey fail fell 0.748
## 2 Joey low back 0.0429
## 3 Joey pin pen 0.353
## 4 Joey pull pole NA
## 5 Stanley fail fell 0.805
## 6 Stanley low back 0.0377
## 7 Stanley pin pen 0.249
## 8 Stanley pull pole NA
This structure may be handy for some types of visualizations. For example, you can make a plot like this which shows the spread of Pillai scores across each vowel pair. I’ve underlaid a violin plot so you can see the distribution a little better.
ggplot(pillai_table, aes(vowel_pair, pillai)) +
geom_violin(color = "gray75") +
geom_text(aes(label = fake_speaker))
But you may need to rearrange your data in a different way. For example, you may want just one speaker per row and each vowel pair in its own column. Fortunately, the tidyr
package has a really handy function called spread
I explain this spread
function in a little more detail in my second tutorial on plotting formant data.that’ll do this. There are two arguments to spread
:
-
The name of the column whose values you want to spread out into their own columns. In our case, the
vowel_pair
column contains the four different pairs. -
The name of the existing columns whose values you want to be contained in the new columns. In our case, the
pillai
column contains all those numbers and we want those numbers to fill the cells of the new columns being created.
So, if we put it together, this is what we get.
pillai_table_wide <- pillai_table %>%
spread(vowel_pair, pillai)
pillai_table_wide
## # A tibble: 8 x 4
## fake_speaker `fail fell` `low back` `pin pen`
## <chr> <dbl> <dbl> <dbl>
## 1 Andy 0.762 0.0128 0.381
## 2 Betty 0.660 0.0438 0.367
## 3 Chuck 0.862 0.0272 0.322
## 4 Dolly 0.822 0.0610 0.376
## 5 Earl 0.775 0.0498 0.279
## 6 Flo 0.754 0.0395 0.237
## 7 Gary 0.738 0.0384 0.337
## 8 Helga 0.795 0.0423 0.178
Now, with our data reshaped in this way, we could do a scatterplot to see the correlation between two mergers:
There are ticks around low back
and pin pen
because those are the column names themselves and when you refer to column names that have spaces in them you have to surround them with ticks. These show up in the plot itself, but you can fix that with labs
.
ggplot(pillai_table_wide, aes(`low back`, `pin pen`)) +
geom_text(aes(label = fake_speaker))
Because I’ve randomly split by data up, there’s really no pattern, but if you suspect a correlation between mergers in your dataset, this plot might be more informative for you.
Another plot you may want to see is the correlation between the Pillai score and the Bhattacharyya’s Affinity. We’ll calculate both of them at once and create an overlap_table
object.
overlap_table <- all_pairs %>%
group_by(fake_speaker, vowel_pair) %>%
summarize(pillai = pillai2(F1, F2, vowel),
bhatt = bhatt(F1, F2, vowel))
overlap_table
## # A tibble: 24 x 4
## # Groups: fake_speaker [8]
## fake_speaker vowel_pair pillai bhatt
## <chr> <chr> <dbl> <dbl>
## 1 Andy fail fell 0.762 0.241
## 2 Andy low back 0.0128 0.866
## 3 Andy pin pen 0.381 0.807
## 4 Betty fail fell 0.660 0.333
## 5 Betty low back 0.0438 0.907
## 6 Betty pin pen 0.367 0.757
## 7 Chuck fail fell 0.862 0.104
## 8 Chuck low back 0.0272 0.898
## 9 Chuck pin pen 0.322 0.788
## 10 Dolly fail fell 0.822 0.145
## # … with 14 more rows
Now, with this table, we can do another scatterplot, but pitting the two measurements against each other. We’d expect a negative correlation since overlap in Pillai means a 0 and in Bhattacharyya’s Affinity it means 1. But can we find anything that stands out? Here, I’ll plot all the data, but color the names by vowel pair.
ggplot(overlap_table, aes(pillai, bhatt, color = vowel_pair)) +
geom_segment(x = 1, y = 0, xend = 0, yend = 1, color = "gray50") +
geom_text(aes(label = fake_speaker))
So this is kind of cool. For the low back merger, the distribution is tight and there’s not a lot going on. For the other two, they’re more spread out. We don’t see too many points straying too far from the line, so there’s not a lot of concern here. Maybe because it’s just a bunch of random samples of my own data it’s not particularly enlightening. Perhaps if you try this on your own data, you might find some interesting differences between the two measurements, which will require some digging.
Conclusion
So that’s it. In this post I covered more of the nitty-gritty detail on a couple topics relating to getting vowel overlap measurements in R. As you can tell, there are some residual problems, like coming up with one function that is flexible enough to allow for any MANOVA formula but still allow for data validation. But the once those formulas are written, you can do some pretty cool stuff in just a couple lines of code. Plus, the visuals are straightforward to implement. Hopefully, thanks to this tutorial, the coding itself is no longer a barrier for you if you’ve been meaning to get vowel overlap measurements for your data.