1 Loading the data & packages

library(tidyverse)
library(broom)

v <- read_csv("tutorial_formants.csv")

In this analysis file we process the output of Polyglot from the formants tutorial (formants_tutorial.csv) and show how it can be used to address (simple) research questions involving variation between speakers in vowel space:

  • Plot the raw data, assess how well the formant measurement algorithm did
  • Filter the data to a subset with reliable measurements
  • Use this subset to make vowel plots (in F1/F2 space) for each speaker
  • Illustrate a couple kinds of interspeaker variation (by dialect, gender)

This analysis is partly adapted from course materials by Márton Sóskuthy (UBC).

2 Data processing

Quick look at columns of the dataframe:

glimpse(v)
## Rows: 74,001
## Columns: 13
## $ speaker     <dbl> 1089, 1089, 1089, 1089, 1089, 1089, 1089, 1089, 1089, 1089…
## $ speaker_sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
## $ file        <chr> "1089-134691-0013", "1089-134691-0011", "1089-134691-0013"…
## $ speech_rate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ word        <chr> "us", "contrite", "him", "all", "only", "becomes", "elabor…
## $ phone       <chr> "AH1", "AY1", "IH1", "AO1", "OW1", "AH1", "ER0", "IH1", "I…
## $ previous    <chr> "D", "R", "HH", "V", "Z", "K", "B", "HH", "T", "D", "S", "…
## $ following   <chr> "S", "T", "M", "L", "N", "M", "EY2", "Z", "W", "V", "L", "…
## $ phone_start <dbl> 7.32, 9.42, 13.15, 6.46, 11.20, 5.67, 18.65, 13.76, 1.15, …
## $ phone_end   <dbl> 7.40, 9.50, 13.22, 6.82, 11.31, 5.74, 18.76, 13.81, 1.33, …
## $ F1          <dbl> 501.02, 653.54, 396.60, 580.47, 549.19, 512.96, 615.42, 33…
## $ F2          <dbl> 1517.61, 1345.15, 1785.44, 899.85, 955.67, 1215.04, 1214.0…
## $ F3          <dbl> 2356.44, 2022.31, 2498.21, 2306.28, 2606.67, 2247.82, 1826…

These mostly look like we’d expect: phone_start and phone_end are times, previous and following are the phones following the vowel (phone), F1/F2/F3 are formants, and o on. speech_rate should not be all NAs; there must be a bug.

Define and correct some columns:

  • Vowel duration
  • stress: this is e.g. 1 in AH1
  • stressed: stress = 0 vs 2/1
  • vowel: this is e.g. AH in AH
  • speaker: should be character (or factor)
v <- v %>% mutate(
  duration = phone_end - phone_start,
  stress = str_split(phone, '[A-Z]+', simplify = TRUE)[,2],
  stressed = ifelse(stress=='0', 'N', 'Y'),
  vowel = str_split(phone, '[0-9]', simplify = TRUE)[,1],
  speaker = as.character(speaker)
)

Let’s also label which vowels are diphtohngs, since we’ll want to just plot monophthongs (= all other vowels). This means vowels that just have one target (“heed”, “head”, but not “hide”), so it makes sense to just show one point for them in a vowel plot.

diphthongs <- c("AY", "AW", 'OY')

v <- v %>% mutate(
  vowel_class = ifelse(vowel %in% diphthongs, 'dip', 'mono')
)

Let’s call that the raw dataset:

v_raw <- v

3 Basic plots + cleaning

Examine vowel duration distributions, just for monophthongs, for stressed vowels. Our goal here is to assess what good cutoffs to exclude vowels that are reduced, or whose durations are so long as to be force-alignment errors:

filter(v, vowel_class == 'mono' & stressed == 'Y') %>% 
  ggplot(aes(y=duration, x=vowel)) +
  # faceting by speaker
  #facet_wrap(.~speaker) +
  # using violins to show distributions per vowel
  geom_violin() +
  # adding the median separately as a point
  stat_summary(fun=median, geom="point", shape=16) + 
  scale_y_log10() + 
  annotation_logticks(sides='l')

We can see that:

  • Vowels are (very) roughly normally distributed in log-space, so the distribution of duration is very left-skewed
  • The distribution of durations differs between tense and lax (AH, EH, IH, UH) vowels.
  • Durations below 0.03 don’t exist. Durations above about 0.25 are uncommon—these are likely forced alignment errors.

To make plots of the formant data, get mean F1 and F2 by speaker/vowel/stress, add back to original dataframe:

v_summ <- v %>% group_by(speaker, speaker_sex, stressed, vowel, vowel_class) %>%
summarise(F1_mean = mean(F1),
  F2_mean = mean(F2),
  F3_mean = mean(F3),
  n = n()
) 
 v <- v_summ %>% left_join(v)

Now let’s plot the raw formant data, by speaker, for stressed monophthongs, showing just points based on 10+ observations:

# setting up plot; separate phones will be shown
# by colour
v %>% filter(stressed == 'Y' & vowel_class=='mono' & n>=10) %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
  # faceting by speaker
  facet_wrap(.~speaker, scales='free') +
  # using phone labels in plot
  stat_ellipse(level = 0.67) + 
  geom_point(size=0.5)+
#  geom_label(data = filter(v_summ, stressed == 'Y' & vowel_class=='mono' & n>5), aes(F2_mean, F1_mean, label=vowel)) + 
  # setting up vowel-chart-like F1 / F2 axes
  scale_y_reverse() +
  scale_x_reverse()

It’s hard to see what’s going on across all speakers—let’s just look at a representative few:

# setting up plot; separate phones will be shown
# by colour
some_speakers <- c('1089', '1284', '1580', '4446', '8224')

v %>% filter(speaker %in% some_speakers & stressed == 'Y' & vowel_class=='mono' & n>=10) %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
  # faceting by speaker
  facet_wrap(.~speaker, scales='free') +
  # using phone labels in plot
  stat_ellipse(level = 0.67) + 
  geom_point(size=0.5, alpha=0.5)+
  geom_point(data = filter(v_summ, speaker %in% some_speakers & stressed == 'Y' & vowel_class=='mono' & n>5), aes(F2_mean, F1_mean, color=vowel)) + 
  # setting up vowel-chart-like F1 / F2 axes
  scale_y_reverse() +
  scale_x_reverse()

For most speakers we see roughly the expected vowel quadrilateral, but with significant noise. Most clearly, there are some tokens for each speaker far outside the quadrilateral, with measurements that can’t be right.

We’ll try to find a subset of measurements which are more reliable, by doing some filtering:

  1. We’ll exclude vowels with duration of 50 msec or less, to exclude heavily reduced vowels (assuming force-aligned durations are correct), and 250 msec or more, which are likely force-alignment errors.
  2. Remove some obvious errors: anything with F1 higher than 1250.
  3. Remove outliers within categories using a custom function (below). This defines a vowel-ellipse and excludes anything that is more than 3 standard deviations away within this ellipse (SD is formalised as the Mahalanobis distance here, since SD’s don’t straightforwardly apply to multivariate distributions.)
## From Marton Soskuthy, edited
## min_size: if there are fewer than this many observations,
## don't exclude any
mvt_outlier <- function (x, y, level_sd= 3, min_size = 10) {
  if(length(x) >= min_size){
  mat <- as.matrix(cbind(x,y))
  mu <- c(mean(mat[,1]), mean(mat[,2]))
  sigma <- cov(mat)
  out <- mahalanobis(mat, center=mu, cov=sigma) > level_sd
  } else{
    out <-rep(FALSE, length(x))
  }
}
v_filt <- v %>%
  # grouping by vowel / speaker
  group_by(vowel, speaker) %>%
  filter(duration > 0.051  & duration < 0.251) %>%
  # removing obvious outliers
  filter(F1 < 1250,
  !mvt_outlier(F1, F2)
  ) %>%
  ungroup()

How much data did we lose?

nrow(v)
## [1] 74001
nrow(v_filt)
## [1] 43758

Vowel plot with filtered data (not not restricting by number of examples):

# setting up plot; separate phones will be shown
# by colour
v_filt %>% filter(stressed == 'Y' & vowel_class=='mono') %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
  # faceting by speaker
  facet_wrap(.~speaker, scales='free') +
  # using phone labels in plot
  stat_ellipse(level = 0.67) + 
  geom_point(size=0.5)+
#  geom_label(data = filter(v_summ, stressed == 'Y' & vowel_class=='mono' & n>5), aes(F2_mean, F1_mean, label=vowel)) + 
  # setting up vowel-chart-like F1 / F2 axes
  scale_y_reverse() +
  scale_x_reverse()

This looks more sensible, though there are still tokens with issues, and a couple problematic speakers (8455, maybe 672).

Let’s examine the F1/F2 centroids for stressed monophthongs (now not restricting by number of tokens):

v_filt %>% filter(vowel_class == 'mono' & stressed == 'Y') %>%
  group_by(speaker, vowel) %>%
  summarise(F1=mean(F1),
            F2=mean(F2)) %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
  facet_wrap(~speaker) +
  geom_text(aes(label=vowel)) +
  scale_y_reverse() +
  scale_x_reverse()

Let’s plot a subset of vowels to point out some features that could reflect dialect variation:

vowel_subset_1 <- c('AA', 'AO', 'AE', 'EY', 'UW', 'IY')
v_filt %>% filter(vowel %in% vowel_subset_1 & stressed == 'Y') %>%
  group_by(speaker, vowel) %>%
  summarise(F1=mean(F1),
            F2=mean(F2)) %>%
ggplot(aes(x=F2, y=F1, col=vowel)) +
  facet_wrap(~speaker) +
  geom_text(aes(label=vowel)) +
  scale_y_reverse() +
  scale_x_reverse()

American English dialects differ in whether AO (“caught”) and AA (“cot”) are merged. It looks like:

  • Speakers 1089, 3576, 5639, etc. may have AO = AA
  • Speakers 1221, 1580, 7021, etc. may have distinct AO and AA.

Vowel space by gender:

v_filt_summ <- v_filt %>% group_by(speaker, speaker_sex, stressed, vowel, vowel_class) %>%
summarise(F1_mean = mean(F1),
  F2_mean = mean(F2),
  F3_mean = mean(F3),
  n = n()
) 

hull_df <- v_filt_summ %>% ungroup() %>% filter(vowel_class=='mono' & stressed=='Y') %>% group_by(speaker) %>% slice(chull(F2_mean, F1_mean))

hull_df %>% ggplot(aes(x=F2_mean, y=F1_mean)) + geom_polygon(aes(color=speaker_sex), fill=NA) + facet_grid(~speaker_sex, labeller='label_both') + scale_x_reverse() + scale_y_reverse() + 
  theme(legend.position = 'none')

We clearly see that vowel space is shifted to lower F2 and F1 on average for male speakers—which is expected, given physiology—but also that female speakers have a “broader” vowel space, on average.

Same thing, by gender and vowel stress:

hull_df_2 <- v_filt_summ %>% ungroup() %>% filter(vowel_class=='mono') %>% group_by(speaker, stressed) %>% slice(chull(F2_mean, F1_mean))

hull_df_2 %>% ggplot(aes(x=F2_mean, y=F1_mean)) + geom_polygon(aes(color=speaker_sex), fill=NA) + facet_grid(stressed~speaker_sex, labeller='label_both') + scale_x_reverse() + scale_y_reverse() + 
  theme(legend.position = 'none')

Same thing, by speaker:

hull_df_2 %>% ggplot(aes(x=F2_mean, y=F1_mean)) + geom_polygon(aes(color=stressed), fill=NA) + facet_wrap(~speaker) + scale_x_reverse() + scale_y_reverse()