1 Preliminaries

Load packages and data output by the PolyglotDB formants tutorial script:

library(tidyverse)
library(broom)

v <- read_csv("librispeech_pitch_full_set.csv")

1.1 Preprocessing

Save original dataset:

v_orig <- v

Quick look at columns of the dataframe:

glimpse(v)
## Rows: 30,210
## Columns: 15
## $ traj_id                      <chr> "fcb0eac8-fcdb-11ec-b292-fc4dd44cd2b1", "…
## $ vowel                        <chr> "AH1", "AH1", "AH1", "AH1", "AH0", "AH0",…
## $ consonant                    <chr> "B", "B", "B", "B", "DH", "DH", "DH", "DH…
## $ following_phone              <chr> "T", "T", "T", "T", "R", "R", "R", "R", "…
## $ word                         <chr> "but", "but", "but", "but", "the", "the",…
## $ word_duration                <dbl> 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17,…
## $ word_transcription           <chr> "B.AH1.T", "B.AH1.T", "B.AH1.T", "B.AH1.T…
## $ following_word_transcription <chr> "T.AH0", "T.AH0", "T.AH0", "T.AH0", "R.IH…
## $ begin                        <dbl> 12.69, 12.69, 12.69, 12.69, 5.73, 5.73, 5…
## $ end                          <dbl> 12.73, 12.73, 12.73, 12.73, 5.84, 5.84, 5…
## $ discourse                    <chr> "1089-134691-0013", "1089-134691-0013", "…
## $ speaker                      <dbl> 1089, 1089, 1089, 1089, 1089, 1089, 1089,…
## $ sex                          <chr> "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ time                         <dbl> 12.695, 12.705, 12.715, 12.725, 5.740, 5.…
## $ F0                           <dbl> -1.00, -1.00, -1.00, -1.00, 102.25, 102.1…

We just want data from words beginning with CV. There are some words starting with VV, which is a bug in the query. Exclude these:

v <- filter(v, !(consonant %in% c("AW1", "OW1", "ER0"))) %>% droplevels()

We are primarily interested in:

  • F0: pitch (in Hz), where -1 means “undefined” (unvoiced frame of speech)
  • sex of speaker
  • speaker
  • consonant: consonant preceding the vowel

Define and correct some columns:

  • stress: this is e.g. 1 in AH1
  • stressed: stress = 0 vs 2/1
  • vowel: this is e.g. AH in AH
  • speaker: should be character (or factor)
  • rel_time: time since beginning of vowel
v <- v %>% mutate(
  stress = str_split(vowel, "[A-Z]+", simplify = TRUE)[, 2],
  stressed = ifelse(stress == "0", "N", "Y"),
  vowel = str_split(vowel, "[0-9]", simplify = TRUE)[, 1],
  speaker = as.character(speaker)
)

v <- v %>%
  group_by(traj_id) %>%
  summarize(min_time = min(time)) %>%
  left_join(v) %>%
  mutate(rel_time = time - min_time) %>%
  select(-min_time)

Below we’ll look at effects of consonant type and vowel height on F0. This requires defining:

  1. Broad types of consonant
v <- v %>%
  mutate(
    cons_class =
      fct_recode(
        consonant,
        voiced_obs = "B", voiced_obs = "DH", voiced_obs = "D", voiced_obs = "G", voiced_obs = "V", voiced_obs = "JH", voiced_obs = "Z",
        vless_obs = "CH", vless_obs = "F", vless_obs = "CH", vless_obs = "HH", vless_obs = "K", vless_obs = "P", vless_obs = "S", vless_obs = "SH", vless_obs = "T", vless_obs = "TH",
        nasal = "M", nasal = "N",
        sonorant = "L", sonorant = "R", sonorant = "W", sonorant = "Y"
      )
  )
  1. Vowel height
v <- v %>% 
  mutate(
    vowel_height = 
      fct_recode(
        vowel,
        high = "IY", high = "UH", high = "UW", high = "IH",
        mid = "EY", mid = "EH", mid = "ER", mid = "OW", mid = "AH",
        low = "AE", low = "AA", low = "AO",
        dipth = "OY", dipth = "AY", dipth = "AW"
      )
  )

Make pitch = -1 be NA:

v[v$F0 == -1, "F0"] <- NA

Speaker mean pitch, and \(z\)-score pitch within speaker:

v <- v %>%
  group_by(speaker) %>%
  summarize(F0_mean = mean(F0, na.rm = T)) %>%
  left_join(v) %>%
  group_by(speaker) %>%
  mutate(F0_std = as.numeric(scale(F0)))

Add vowel duration (I think), basic stats to be used to detect trajectories where pitch tracking was off.

v <- v %>%
  group_by(traj_id) %>%
  summarise(
    dur = 0.010 * (n() - 1),
    ## number and % of NA points in this trajectory
    na_n = sum(is.na(F0)),
    not_na_n = n() - na_n,
    na_frac = na_n / n(),
  ) %>%
  left_join(v)

2 Data cleaning

All trajectories, filtering out those with non-trivial undefined F0:

v %>%
  ## filter out trajectories where any appreciable portion of vowel didn't have F0 detected:
  filter(na_frac < 0.1 & na_n <= 2) %>%
  ggplot(aes(x = rel_time, y = F0_std)) +
  geom_line(aes(group = traj_id))
## Warning: Removed 73 row(s) containing missing values (geom_path).

There are some crazy trajectories there. For this demo, let’s just remove anything 3 SD from speaker mean or >200 msec long (which is a very long vowel):

v <- v %>%
  filter(na_frac < 0.1 & na_n <= 2) %>%
  filter(abs(F0_std) < 3) %>%
  filter(dur < 0.2)
v %>%
  filter(na_frac < 0.1 & na_n <= 2) %>%
  ggplot(aes(x = rel_time, y = F0_std)) +
  geom_line(aes(group = traj_id)) +
  facet_wrap(~speaker)

Looks a bit better.

How much data was excluded?

## this percent
1 - nrow(v) / nrow(v_orig)
## [1] 0.3268454

3 Basic patterns

We expect:

  1. Male speakers have higher F0 than female speakers (on average)
  2. F0 should be higher after voiceless vs. voiced obstruents, more so near the beginning of the vowel
  3. F0 should be higher for high vs. low vowels; mid vowels may pattern in between.
  4. Effects (2) and (3) should be larger at higher F0 values, hence larger for F vs. M speakers.

Examine in just stressed syllables, to avoid prosodic effects.

3.1 Consonant voicing

Voiced vs. voiceless obstruent trajectories as a function of % through the vowel:

v %>%
  filter(stressed == "Y" & cons_class %in% c("voiced_obs", "vless_obs")) %>%
  ggplot(aes(x = rel_time / dur, y = F0_std)) +
  geom_smooth(aes(color = cons_class))

Beautiful. In Hz, faceting by gender:

v %>%
  filter(stressed == "Y" & cons_class %in% c("voiced_obs", "vless_obs")) %>%
  ggplot(aes(x = rel_time / dur, y = F0)) +
  geom_smooth(aes(color = cons_class)) +
  facet_wrap(~sex)

The effect is larger (in Hz) for F speakers than for M speakers, as expected.

Now by-speaker:

v %>%
  filter(stressed == "Y" & cons_class %in% c("voiced_obs", "vless_obs")) %>%
  ggplot(aes(x = rel_time / dur, y = F0_std)) +
  geom_smooth(aes(color = cons_class)) +
  facet_wrap(~speaker, scales = "free_y") +
  theme(legend.position = "none")

Here there is some interesting variability, but most speakers show the expected pattern: blue \(>\) red, at least at beginning of the vowel.

3.2 Vowel height

Trajectories as a function of vowel height, just for stressed syllables, omitting diphthongs (which don’t have a well-defined height):

v %>%
  filter(stressed == "Y" & vowel_height != "dipth") %>%
  droplevels() %>%
  ggplot(aes(x = rel_time / dur, y = F0_std)) +
  geom_smooth(aes(color = vowel_height))

Beautiful. In Hz, faceting by gender:

v %>%
  filter(stressed == "Y" & vowel_height != "dipth") %>%
  droplevels() %>%
  ggplot(aes(x = rel_time / dur, y = F0)) +
  geom_smooth(aes(color = vowel_height)) +
  facet_wrap(~sex)

The effect is larger (in Hz) for F speakers than for M speakers, as expected.