import numpy as np
import peakutils as pu
[docs]def get_F_0( signal, rate, time_step = 0.0, min_pitch = 75, max_pitch = 600,
max_num_cands = 15, silence_thres = .03, voicing_thres = .45,
octave_cost = .01, octave_jump_cost = .35,
voiced_unvoiced_cost = .14, accurate = False, pulse = False ):
"""
Computes median Fundamental Frequency ( :math:`F_0` ).
.. note::
It has been shown that depressed and suicidal men speak with a reduced
fundamental frequency range ( described in:
http://ameriquests.org/index.php/vurj/article/download/2783/1181 ) and
patients responding well to depression treatment show an increase in
their fundamental frequency variability ( described in :
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3022333/ ). Because
acoustical properties of speech are the earliest and most consistent
indicators of mood disorders, early detection of fundamental frequency
changes could significantly improve recovery time for disorders with
psychomotor symptoms.
The fundamental frequency ( :math:`F_0` ) of a signal is the lowest
frequency, or the longest wavelength of a periodic waveform. In the context
of this algorithm, :math:`F_0` is calculated by segmenting a signal into
frames, then for each frame the most likely candidate is chosen from the
lowest possible frequencies to be :math:`F_0` of that frame. From all of
these values, the median value is returned. More specifically, the
algorithm filters out frequencies higher than the Nyquist Frequency from
the signal, then segments the signal into frames of at least 3 periods of
the minimum pitch. For each frame, it then calculates the normalized
autocorrelation ( :math:`r_a` ), or the correlation of the signal to a
delayed copy of itself. :math:`r_a` is calculated according to Boersma's
paper ( referenced below ), which is an improvement of previous methods.
:math:`r_a` is estimated by dividing the autocorrelation of the windowed
signal by the autocorrelation of the window. After :math:`r_a` is
calculated the maxima values of :math:`r_a` are found. These points
correspond to the lag domain, or points in the delayed signal, where the
correlation value has peaked, with higher peaks indicating a stronger
correlations. These points in the lag domain suggest places of wave
repetition and are the candidates for :math:`F_0`. The best candidate for
:math:`F_0` of each frame is picked by a cost function that compares the
cost of transitioning from the best :math:`F_0` of the previous frame to
all possible :math:`F_0's` of the current frame. Once the path of
:math:`F_0's` of least cost has been determined, the median :math:`F_0` of
all voiced frames is returned.
This algorithm is adapted from:
http://www.fon.hum.uva.nl/david/ba_shs/2010/Boersma_Proceedings_1993.pdf
and from:
https://github.com/praat/praat/blob/master/fon/Sound_to_Pitch.cpp
Args:
signal ( numpy.ndarray ): The signal :math:`F_0` will be calculated from.
rate ( int ): the number of samples per seconds that the signal was sampled at.
time_step ( float ): ( optional, default value: 0.0 ) the measurement, in seconds, of time passing between each frame. The smaller the time_step, the more overlap that will occur. If 0 is supplied the degree of oversampling will be equal to four.
min_pitch ( float ): ( optional, default value: 75 ) minimum value to be returned as pitch, cannot be less than or equal to zero
max_pitch ( float ): ( optional, default value: 600 ) maximum value to be returned as pitch, cannot be greater than Nyquist Frequency
max_num_cands ( int ): ( optional, default value: 15 ) maximum number of candidates to be considered for each frame, unvoiced candidate ( i.e. :math:`F_0` equal to zero ) is always considered.
silence_thres ( float ): ( optional, default value: 0.03 ) frames that do not contain amplitudes above this threshold ( relative to the global maximum amplitude ), are probably silent.
voicing_thres ( float ): ( optional, default value: 0.45 ) the strength of the unvoiced candidate, relative to the maximum possible autocorrelation. To increase the number of unvoiced decisions, increase this value.
octave_cost ( float ): ( optional, default value: 0.01 per octave ) degree of favouring of high-frequency candidates, relative to the maximum possible :math:`r_a`. This is necessary because in the case of a perfectly periodic signal, all undertones of :math:`F_0` are equally strong candidates as :math:`F_0` itself. To more strongly favour recruitment of high-frequency candidates, increase this value.
octave_jump_cost ( float ): ( optional, default value: 0.35 ) degree of disfavouring of pitch changes, relative to the maximum possible :math:`r_a`. To decrease the number of large frequency jumps, increase this value.
voiced_unvoiced_cost ( float ): ( optional, default value: 0.14 ) degree of disfavouring of voiced/unvoiced transitions, relative to the maximum possible :math:`r_a`. To decrease the number of voiced/unvoiced transitions, increase this value.
accurate ( bool ): ( optional, default value: False ) if False, the window is a Hanning window with a length of :math:`\\frac{ 3.0} {min\_pitch}`. If True, the window is a Gaussian window with a length of :math:`\\frac{6.0}{min\_pitch}`, i.e. twice the length.
pulse ( bool ): ( optional, default value: False ) if False, returns a list containing only the median :math:`F_0`, if True, returns a list with all values necessary to calculate pulses. This list contains the median :math:`F_0`, the frequencies for each frame in a list, a list of tuples containing the beginning time of the frame, and the ending time of the frame, and the signal filtered by the Nyquist Frequency. The indicies in the second and third list correspond to each other.
Returns:
list: index 0 contains the median :math:`F_0` in hz, if pulse is set
equal to True, indicies 1, 2, and 3 will contain:
a list of all voiced frequencies in order,\n
a list of tuples of the beginning and ending time of a voiced
interval, each index in the list corresponding to the previous
list, and \n
a numpy.ndarray of the signal filtered by the Nyquist Frequency.
If pulse is set equal to False, or left to the default value, then the
list will only contain the median :math:`F_0`.
Raises:
ValueError: min_pitch has to be greater than zero.
ValueError: octave_cost isn't in [ 0, 1 ].
ValueError: silence_thres isn't in [ 0, 1 ].
ValueError: voicing_thres isn't in [ 0, 1 ].
ValueError: max_pitch can't be larger than Nyquist Frequency.
Example:
The example below demonstrates what different outputs this function gives,
using a synthesized signal.
>>> import numpy as np
>>> from matplotlib import pyplot as plt
>>> domain = np.linspace( 0, 6, 300000 )
>>> rate = 50000
>>> y = lambda x: np.sin( 2 * np.pi * 140 * x )
>>> signal = y( domain )
>>> get_F_0( signal, rate )
[ 139.70588235294116 ]
>>> get_F_0( signal, rate, voicing_threshold = .99, accurate = True )
[ 139.70588235294116 ]
>>> w, x, y, z = get_F_0( signal, rate, pulse = True )
>>> print( w )
139.70588235294116
>>> print( x[ :5 ] )
[ 0.00715789 0.00715789 0.00715789 0.00715789 0.00715789 ]
>>> print( y[ :5 ] )
[ ( 0.002500008333361111, 0.037500125000416669 ),
( 0.012500041666805555, 0.047500158333861113 ),
( 0.022500075000249999, 0.057500191667305557 ),
( 0.032500108333694447, 0.067500225000749994 ),
( 0.042500141667138891, 0.077500258334194452 ) ]
>>> print( z[ : 5 ] )
[ 0. 0.01759207 0.0351787 0.05275443 0.07031384 ]
The example below demonstrates the algorithms ability to adjust for signals
with dynamic frequencies, by comparing a plot of a synthesized signal with
an increasing frequency and the calculated frequencies for that signal.
>>> domain = np.linspace( 1, 2, 10000 )
>>> rate = 10000
>>> y = lambda x : np.sin( x ** 8 )
>>> signal = y( domain )
>>> median_F_0, periods, time_vals, modified_sig = get_F_0( signal,
rate, pulse = True )
>>> plt.subplot( 211 )
>>> plt.plot( domain, signal )
>>> plt.title( "Synthesized Signal" )
>>> plt.ylabel( "Amplitude" )
>>> plt.subplot( 212 )
>>> plt.plot( np.linspace( 1, 2, len( periods ) ), 1.0 / np.array(
periods ) )
>>> plt.title( "Frequencies of Signal" )
>>> plt.xlabel( "Samples" )
>>> plt.ylabel( "Frequency" )
>>> plt.suptitle( "Comparison of Synthesized Signal and it's Calculated Frequencies" )
>>> plt.show()
.. figure:: figures/F_0_synthesized_sig.png
:align: center
"""
if min_pitch <= 0:
raise ValueError( "min_pitch has to be greater than zero." )
if max_num_cands < max_pitch / min_pitch:
max_num_cands = int( max_pitch / min_pitch )
initial_len = len( signal )
total_time = initial_len / float( rate )
tot_time_arr = np.linspace( 0, total_time, initial_len )
max_place_poss = 1.0 / min_pitch
min_place_poss = 1.0 / max_pitch
#to silence formants
min_place_poss2 = 0.5 / max_pitch
if accurate: pds_per_window = 6.0
else: pds_per_window = 3.0
#degree of oversampling is 4
if time_step <= 0: time_step = ( pds_per_window / 4.0 ) / min_pitch
w_len = pds_per_window / min_pitch
#correcting for time_step
octave_jump_cost *= .01 / time_step
voiced_unvoiced_cost *= .01 / time_step
Nyquist_Frequency = rate / 2.0
upper_bound = .95 * Nyquist_Frequency
zeros_pad = 2 ** ( int( np.log2( initial_len ) ) + 1 ) - initial_len
signal = np.hstack( ( signal, np.zeros( zeros_pad ) ) )
fft_signal = np.fft.fft( signal )
fft_signal[ int( upper_bound ) : -int( upper_bound ) ] = 0
sig = np.fft.ifft( fft_signal )
sig = sig[ :initial_len ].real
#checking to make sure values are valid
if Nyquist_Frequency < max_pitch:
raise ValueError( "max_pitch can't be larger than Nyquist Frequency." )
if octave_cost < 0 or octave_cost > 1:
raise ValueError( "octave_cost isn't in [ 0, 1 ]" )
if voicing_thres< 0 or voicing_thres > 1:
raise ValueError( "voicing_thres isn't in [ 0, 1 ]" )
if silence_thres < 0 or silence_thres > 1:
raise ValueError( "silence_thres isn't in [ 0, 1 ]" )
#finding number of samples per frame and time_step
frame_len = int( w_len * rate + .5 )
time_len = int( time_step * rate + .5 )
#initializing list of candidates for F_0, and their strengths
best_cands, strengths, time_vals = [], [], []
#finding the global peak the way Praat does
global_peak = max( abs( sig - sig.mean() ) )
print(type(global_peak),'\n')
e = np.e
inf = np.inf
log = np.log2
start_i = 0
while start_i < len( sig ) - frame_len :
end_i = start_i + frame_len
segment = sig[ start_i : end_i ]
if accurate:
t = np.linspace( 0, w_len, len( segment ) )
numerator = e ** ( -12.0 * ( t / w_len - .5 ) ** 2.0 ) - e ** -12.0
denominator = 1.0 - e ** -12.0
window = numerator / denominator
interpolation_depth = 0.25
else:
window = np.hanning( len( segment ) )
interpolation_depth = 0.50
#shave off ends of time intervals to account for overlapping
start_time = tot_time_arr[ start_i + int( time_len / 4.0 ) ]
stop_time = tot_time_arr[ end_i - int( time_len / 4.0 ) ]
time_vals.append( ( start_time, stop_time ) )
start_i += time_len
long_pd_i = int( rate / min_pitch )
half_pd_i = int( long_pd_i / 2.0 + 1 )
long_pd_cushion = segment[ half_pd_i : - half_pd_i ]
#finding local peak and local mean the way Praat does
#local mean is found by looking a longest period to either side of the
#center of the frame, and using only the values within this interval to
#calculate the local mean, and similarly local peak is found by looking
#a half of the longest period to either side of the center of the
#frame, ( after the frame has windowed ) and choosing the absolute
#maximum in this interval
local_mean = long_pd_cushion.mean()
segment = segment - local_mean
segment *= window
half_pd_cushion = segment[ long_pd_i : -long_pd_i ]
local_peak = max( abs( half_pd_cushion ) )
if local_peak == 0:
#shortcut -> complete silence and only candidate is silent candidate
best_cands.append( [ inf ] )
strengths.append( [ voicing_thres + 2 ] )
else:
#calculating autocorrelation, based off steps 3.2-3.10
intensity = local_peak / float( global_peak )
N = len( segment )
nFFT = 2 ** int( log( ( 1.0 + interpolation_depth ) * N ) + 1 )
window = np.hstack( ( window, np.zeros( nFFT - N ) ) )
segment = np.hstack( ( segment, np.zeros( nFFT - N ) ) )
x_fft = np.fft.fft( segment )
r_a = np.real( np.fft.fft( x_fft * np.conjugate( x_fft ) ) )
r_a = r_a[ : int( N / pds_per_window ) ]
x_fft = np.fft.fft( window )
r_w = np.real( np.fft.fft( x_fft * np.conjugate( x_fft ) ) )
r_w = r_w[ : int( N / pds_per_window ) ]
r_x = r_a / r_w
r_x /= r_x[ 0 ]
#creating an array of the points in time corresponding to sampled
#autocorrelation of the signal ( r_x )
time_array = np.linspace( 0 , w_len / pds_per_window, len( r_x ) )
peaks = pu.indexes( r_x , thres = 0 )
max_values, max_places = r_x[ peaks ], time_array[ peaks ]
#only consider places that are voiced over a certain threshold
max_places = max_places[ max_values > 0.5 * voicing_thres ]
max_values = max_values[ max_values > 0.5 * voicing_thres ]
for i in range( len( max_values ) ):
#reflecting values > 1 through 1.
if max_values[ i ] > 1.0 :
max_values[ i ] = 1.0 / max_values[ i ]
#calculating the relative strength value
rel_val = [ val - octave_cost * log( place * min_pitch ) for
val, place in zip( max_values, max_places ) ]
if len( max_values ) > 0.0 :
#finding the max_num_cands-1 maximizers, and maximums, then
#calculating their strengths ( eq. 23 and 24 ) and accounting for
#silent candidate
max_places = [ max_places[ i ] for i in np.argsort( rel_val )[
-max_num_cands + 1 : ] ]
max_values = [ max_values[ i ] for i in np.argsort( rel_val )[
-max_num_cands + 1 : ] ]
max_places = np.array( max_places )
max_values = np.array( max_values )
rel_val = list(np.sort( rel_val )[ -max_num_cands + 1 : ] )
#adding the silent candidate's strength to strengths
rel_val.append( voicing_thres + max( 0, 2 - ( intensity /
( silence_thres / ( 1 + voicing_thres ) ) ) ) )
#inf is our silent candidate
max_places = np.hstack( ( max_places, inf ) )
best_cands.append( list( max_places ) )
strengths.append( rel_val )
else:
#if there are no available maximums, only account for silent
#candidate
best_cands.append( [ inf ] )
strengths.append( [ voicing_thres + max( 0, 2 - intensity /
( silence_thres / ( 1 + voicing_thres ) ) ) ] )
#Calculates smallest costing path through list of candidates ( forwards ),
#and returns path.
best_total_cost, best_total_path = -inf, []
#for each initial candidate find the path of least cost, then of those
#paths, choose the one with the least cost.
for cand in range( len( best_cands[ 0 ] ) ):
start_val = best_cands[ 0 ][ cand ]
total_path = [ start_val ]
level = 1
prev_delta = strengths[ 0 ][ cand ]
maximum = -inf
while level < len( best_cands ) :
prev_val = total_path[ -1 ]
best_val = inf
for j in range( len( best_cands[ level ] ) ):
cur_val = best_cands[ level ][ j ]
cur_delta = strengths[ level ][ j ]
cost = 0
cur_unvoiced = cur_val == inf or cur_val < min_place_poss2
prev_unvoiced = prev_val == inf or prev_val < min_place_poss2
if cur_unvoiced:
#both voiceless
if prev_unvoiced:
cost = 0
#voiced-to-unvoiced transition
else:
cost = voiced_unvoiced_cost
else:
#unvoiced-to-voiced transition
if prev_unvoiced:
cost = voiced_unvoiced_cost
#both are voiced
else:
cost = octave_jump_cost * abs( log( cur_val /
prev_val ) )
#The cost for any given candidate is given by the transition
#cost, minus the strength of the given candidate
value = prev_delta - cost + cur_delta
if value > maximum: maximum, best_val = value, cur_val
prev_delta = maximum
total_path.append( best_val )
level += 1
if maximum > best_total_cost:
best_total_cost, best_total_path = maximum, total_path
f_0_forth = np.array( best_total_path )
#Calculates smallest costing path through list of candidates ( backwards ),
#and returns path. Going through the path backwards introduces frequencies
#previously marked as unvoiced, or increases undertones, to decrease
#frequency jumps
best_total_cost, best_total_path2 = -inf, []
#Starting at the end, for each initial candidate find the path of least
#cost, then of those paths, choose the one with the least cost.
for cand in range( len( best_cands[ -1 ] ) ):
start_val = best_cands[ -1 ][ cand ]
total_path = [ start_val ]
level = len( best_cands ) - 2
prev_delta = strengths[ -1 ][ cand ]
maximum = -inf
while level > -1 :
prev_val = total_path[ -1 ]
best_val = inf
for j in range( len( best_cands[ level ] ) ):
cur_val = best_cands[ level ][ j ]
cur_delta = strengths[ level ][ j ]
cost = 0
cur_unvoiced = cur_val == inf or cur_val < min_place_poss2
prev_unvoiced = prev_val == inf or prev_val < min_place_poss2
if cur_unvoiced:
#both voiceless
if prev_unvoiced:
cost = 0
#voiced-to-unvoiced transition
else:
cost = voiced_unvoiced_cost
else:
#unvoiced-to-voiced transition
if prev_unvoiced:
cost = voiced_unvoiced_cost
#both are voiced
else:
cost = octave_jump_cost * abs( log( cur_val /
prev_val ) )
#The cost for any given candidate is given by the transition
#cost, minus the strength of the given candidate
value = prev_delta - cost + cur_delta
if value > maximum: maximum, best_val = value, cur_val
prev_delta = maximum
total_path.append( best_val )
level -= 1
if maximum > best_total_cost:
best_total_cost, best_total_path2 = maximum, total_path
f_0_back = np.array( best_total_path2 )
#reversing f_0_backward so the initial value corresponds to first frequency
f_0_back = f_0_back[ -1 : : -1 ]
#choose the maximum frequency from each path for the total path
f_0 = np.array( [ min( i, j ) for i, j in zip( f_0_forth, f_0_back ) ] )
if pulse:
#removing all unvoiced time intervals from list
removed = 0
for i in range( len( f_0 ) ):
if f_0[ i ] > max_place_poss or f_0[ i] < min_place_poss:
time_vals.remove( time_vals[ i - removed ] )
removed += 1
for i in range( len( f_0 ) ):
#if f_0 is voiceless assign occurance of peak to inf -> when divided
#by one this will give us a frequency of 0, corresponding to a unvoiced
#frame
if f_0[ i ] > max_place_poss or f_0[ i ] < min_place_poss :
f_0[ i ] = inf
f_0 = f_0[ f_0 < inf ]
if pulse:
return [ np.median( 1.0 / f_0 ), list( f_0 ), time_vals, signal ]
if len( f_0 ) == 0:
return [ 0 ]
else:
return [ np.median( 1.0 / f_0 ) ]
[docs]def get_HNR( signal, rate, time_step = 0, min_pitch = 75,
silence_threshold = .1, periods_per_window = 4.5 ):
"""
Computes mean Harmonics-to-Noise ratio ( HNR ).
.. note::
The Harmonics-to-Noise ratio of a person's voice is strongly negatively
correlated to depression severity ( described in:
https://ll.mit.edu/mission/cybersec/publications/publication-files/full_papers/2012_09_09_MalyskaN_Interspeech_FP.pdf )
and can be used as an early indicator depression and suicide risk.
After this indicator has been realized, preventative medicine can be
implemented, improving recovery time or even preventing further
symptoms.
The Harmonics-to-Noise ratio ( HNR ) is the ratio
of the energy of a periodic signal to the energy of the noise in the
signal, expressed in dB, and often used as a measure of hoarseness in a
person's voice. By way of illustration, if 99% of the energy of the signal
is in the periodic part and 1% of the energy is in noise, then the HNR is
:math:`10 \cdot log_{10}( \\frac{99}{1} ) = 20`. A HNR of 0 dB means that
there is equal energy in the harmonics and in the noise. The first step for
HNR determination for a signal, in the context of this algorithm, is to
set the maximum frequency to the signal's Nyquist Frequency. Then the
signal is segmented into frames of length
:math:`\\frac{periods\_per\_window}{min\_pitch}`. For each frame it then
calculates the normalized autocorrelation ( :math:`r_a` ), or the
correlation of the signal to a delayed copy of itself. :math:`r_a` is
calculated according to Boersma's paper ( referenced below ). The highest
peak is picked from :math:`r_a`. If the height of this peak is larger that
the strength of the silent candidate then the HNR for this frame is
calculated from that peak. The height of the peak corresponds to the energy
of the periodic part of the signal. Once the HNR value has been calculated
for all voiced frames, the mean is taken from these values and returned.
This algorithm is adapted from:
http://www.fon.hum.uva.nl/david/ba_shs/2010/Boersma_Proceedings_1993.pdf
and from:
https://github.com/praat/praat/blob/master/fon/Sound_to_Harmonicity.cpp
Args:
signal ( numpy.ndarray ): The signal the HNR will be calculated from.
rate ( int ): the number of samples per seconds that the signal was sampled at.
time_step ( float ): ( optional, default value: 0.0 ) the measurement, in seconds, of time passing between each frame. The smaller the time_step, the more overlap that will occur. If 0 is supplied the degree of oversampling will be equal to four.
min_pitch ( float ): ( optional, default value: 75 ) minimum value to be returned as pitch, cannot be less than or equal to zero
silence_threshold ( float ): ( optional, default value: 0.1 ) frames that do not contain amplitudes above this threshold ( relative to the global maximum amplitude ), are considered silent.
periods_per_window ( float ): ( optional, default value: 4.5 ) 4.5 is best for speech. The more periods contained per frame the more the algorithm becomes sensitive to dynamic changes in the signal.
Returns:
float: The mean HNR of the signal expressed in dB.
Raises:
ValueError: min_pitch has to be greater than zero.
ValueError: silence_threshold isn't in [ 0, 1 ].
Example:
The example below adjusts parameters of the function, using the same
synthesized signal with added noise to demonstrate the stability of the
function.
>>> import numpy as np
>>> from matplotlib import pyplot as plt
>>> domain = np.linspace( 0, 6, 300000 )
>>> rate = 50000
>>> y = lambda x:( 1 + .3 * np.sin( 2 * np.pi * 140 * x ) ) * np.sin(
2 * np.pi * 140 * x )
>>> signal = y( domain ) + .2 * np.random.random( 300000 )
>>> get_HNR( signal, rate )
21.885338007330802
>>> get_HNR( signal, rate, periods_per_window = 6 )
21.866307805597849
>>> get_HNR( signal, rate, time_step = .04, periods_per_window = 6 )
21.878451649148804
We'd expect that an increase in noise would reduce HNR and that when the
energy of the noise and the energy of the signal are similar HNR approaches
zero, demonstrated below.
>>> signals = [ y( domain ) + i / 10.0 * np.random.random( 300000 ) for
i in range( 1, 11 ) ]
>>> HNRx10 = [ get_HNR( sig, rate ) for sig in signals ]
>>> plt.plot( np.linspace( .1, 1, 10 ), HNRx10 )
>>> plt.xlabel( "Amount of Added Noise" )
>>> plt.ylabel( "HNR" )
>>> plt.title( "HNR Values of Signals with Added Noise" )
>>> plt.show()
.. figure:: figures/HNR_values_added_noise.png
:align: center
"""
#checking to make sure values are valid
if min_pitch <= 0:
raise ValueError( "min_pitch has to be greater than zero." )
if silence_threshold < 0 or silence_threshold > 1:
raise ValueError( "silence_threshold isn't in [ 0, 1 ]." )
#degree of overlap is four
if time_step <= 0: time_step = ( periods_per_window / 4.0 ) / min_pitch
Nyquist_Frequency = rate / 2.0
max_pitch = Nyquist_Frequency
global_peak = max( abs( signal - signal.mean() ) )
window_len = periods_per_window / float( min_pitch )
#finding number of samples per frame and time_step
frame_len = int( window_len * rate )
t_len = int( time_step * rate )
#segmenting signal, there has to be at least one frame
num_frames = max( 1, int( len( signal ) / t_len + .5 ) )
seg_signal = [ signal[ int( i * t_len ) : int( i * t_len ) + frame_len ]
for i in range( num_frames + 1 ) ]
#initializing list of candidates for HNR
best_cands = []
for index in range( len( seg_signal ) ):
segment = seg_signal[ index ]
#ignoring any potential empty segment
if len( segment) > 0:
window_len = len( segment ) / float( rate )
#calculating autocorrelation, based off steps 3.2-3.10
segment = segment - segment.mean()
local_peak = max( abs( segment ) )
if local_peak == 0 :
best_cands.append( .5 )
else:
intensity = local_peak / global_peak
window = np.hanning( len( segment ) )
segment *= window
N = len( segment )
nsampFFT = 2 ** int( np.log2( N ) + 1 )
window = np.hstack( ( window, np.zeros( nsampFFT - N ) ) )
segment = np.hstack( ( segment, np.zeros( nsampFFT - N ) ) )
x_fft = np.fft.fft( segment )
r_a = np.real( np.fft.fft( x_fft * np.conjugate( x_fft ) ) )
r_a = r_a[ : N ]
r_a = np.nan_to_num( r_a )
x_fft = np.fft.fft( window )
r_w = np.real( np.fft.fft( x_fft * np.conjugate( x_fft ) ) )
r_w = r_w[ : N ]
r_w = np.nan_to_num( r_w )
r_x = r_a / r_w
r_x /= r_x[ 0 ]
#creating an array of the points in time corresponding to the
#sampled autocorrelation of the signal ( r_x )
time_array = np.linspace( 0, window_len, len( r_x ) )
i = pu.indexes( r_x )
max_values, max_places = r_x[ i ], time_array[ i ]
max_place_poss = 1.0 / min_pitch
min_place_poss = 1.0 / max_pitch
max_values = max_values[ max_places >= min_place_poss ]
max_places = max_places[ max_places >= min_place_poss ]
max_values = max_values[ max_places <= max_place_poss ]
max_places = max_places[ max_places <= max_place_poss ]
for i in range( len( max_values ) ):
#reflecting values > 1 through 1.
if max_values[ i ] > 1.0 :
max_values[ i ] = 1.0 / max_values[ i ]
#eq. 23 and 24 with octave_cost, and voicing_threshold set to zero
if len( max_values ) > 0:
strengths = [ max( max_values ), max( 0, 2 - ( intensity /
( silence_threshold ) ) ) ]
#if the maximum strength is the unvoiced candidate, then .5
#corresponds to HNR of 0
if np.argmax( strengths ):
best_cands.append( 0.5 )
else:
best_cands.append( strengths[ 0 ] )
else:
best_cands.append( 0.5 )
best_cands = np.array( best_cands )
best_cands = best_cands[ best_cands > 0.5 ]
if len(best_cands) == 0:
return 0
#eq. 4
best_cands = 10.0 * np.log10( best_cands / ( 1.0 - best_cands ) )
best_candidate = np.mean( best_cands )
return best_candidate
[docs]def get_Pulses( signal, rate, min_pitch = 75, max_pitch = 600,
include_max = False, include_min = True ):
"""
Computes glottal pulses of a signal.
.. note::
This algorithm is a helper function for the jitter algorithm, that
returns a list of points in the time domain corresponding to minima or
maxima of the signal. These minima or maxima are the sequence of
glottal closures in vocal-fold vibration. The distance between
consecutive pulses is defined as the wavelength of the signal at this
interval, which can be used to later calculate jitter.
This algorithm relies on the voiced/unvoiced decisions and fundamental
frequencies calculated for each voiced frame by get_F_0. For every voiced
interval, a list of points is created by finding the initial point
:math:`t_1`, which is the absolute extremum ( or the maximum/minimum,
depending on your include_max and include_min parameters ) of the amplitude
of the sound in the interval
:math:`[\ t_{mid} - \\frac{T_0}{2},\ t_{mid} + \\frac{T_0}{2}\ ]`, where
:math:`t_{mid}` is the midpoint of the interval, and :math:`T_0` is the
period at :math:`t_{mid}`, as can be linearly interpolated from the periods
acquired from get_F_0. From this point, the algorithm searches for points
:math:`t_i` to the left until we reach the left edge of the interval. These
points are the absolute extrema ( or the maxima/minima ) in the interval
:math:`[\ t_{i-1} - 1.25 \cdot T_{i-1},\ t_{i-1} - 0.8 \cdot T_{i-1}\ ]`,
with :math:`t_{i-1}` being the last found point, and :math:`T_{i-1}` the
period at this point. The same is done to the right of :math:`t_1`. The
points are returned in consecutive order.
This algorithm is adapted from:
https://pdfs.semanticscholar.org/16d5/980ba1cf168d5782379692517250e80f0082.pdf
and from:
https://github.com/praat/praat/blob/master/fon/Sound_to_PointProcess.cpp
Args:
signal ( numpy.ndarray ): The signal the glottal pulses will be calculated from.
rate ( int ): the number of samples per seconds that the signal was sampled at.
min_pitch ( float ): ( optional, default value: 75 ) minimum value to be returned as pitch, cannot be less than or equal to zero
max_pitch ( float ): ( optional, default value: 600 ) maximum value to be returned as pitch, cannot be greater than Nyquist Frequency
include_max ( bool ): ( optional, default value: False ) determines if maxima values will be used when calculating pulses
include_min ( bool ): ( optional, default value: True ) determines if minima values will be used when calculating pulses
Returns:
numpy.ndarray: an array of points in a time series that correspond to
the signal periodicity
Raises:
ValueError: include_min and include_max can't both be False
Example:
Pulses are calculated for a synthesized signal, and the variation in time
between each pulse is shown.
>>> import numpy as np
>>> from matplotlib import pyplot as plt
>>> domain = np.linspace( 0, 6, 300000 )
>>> y = lambda x:( 1 + .3 * np.sin( 2 * np.pi * 140 * x ) ) * np.sin(
2 * np.pi * 140 * x )
>>> signal = y( domain ) + .2 * np.random.random( 300000 )
>>> rate = 50000
>>> p = get_Pulses( signal, rate )
>>> print( p[ :5 ] )
[ 0.00542001 0.01236002 0.01946004 0.02702005 0.03402006 ]
>>> print( np.diff( p[ :6 ] ) )
[ 0.00694001 0.00710001 0.00756001 0.00700001 0.00712001 ]
>>> p = get_Pulses( signal, rate, include_max = True )
>>> print( p[ :5 ] )
[ 0.00886002 0.01608003 0.02340004 0.03038006 0.03732007 ]
>>> print( np.diff( p[ :6 ] ) )
[ 0.00722001 0.00732001 0.00698001 0.00694001 0.00734001 ]
A synthesized signal, with an increasing frequency and the calculated pulses
of that signal are plotted together to demonstrate the algorithms ability
to adapt to dynamic pulses.
>>> domain = np.linspace( 1.85, 2.05, 10000 )
>>> rate = 50000
>>> y = lambda x : np.sin( x ** 8 )
>>> signal = np.hstack( ( np.zeros( 2500 ), y( domain[ 2500: -2500 ] ),
np.zeros( 2500 ) ) )
>>> pulses = get_Pulses( signal, rate )
>>> plt.plot( domain, signal, 'r', alpha = .5, label = "Signal" )
>>> plt.plot( ( 1.85 + pulses[ 0 ] ) * np.ones ( 5 ),
np.linspace( -1, 1, 5 ), 'b', alpha = .5, label = "Pulses" )
>>> plt.legend()
>>> for pulse in pulses[ 1: ]:
>>> plt.plot( ( 1.85 + pulse ) * np.ones ( 5 ),
np.linspace( -1, 1, 5 ), 'b', alpha = .5 )
>>> plt.xlabel( "Samples" )
>>> plt.ylabel( "Amplitude" )
>>> plt.title( "Signal with Pulses, Calculated from Minima of Signal" )
>>> plt.show()
.. figure:: figures/Pulses_sig.png
:align: center
"""
#first calculate F_0 estimates for each voiced interval
add = np.hstack
if not include_max and not include_min:
raise ValueError( "include_min and include_max can't both be False" )
median, period, intervals, signal = get_F_0( signal, rate,
min_pitch = min_pitch,
max_pitch = max_pitch,
pulse = True )
global_peak = max( abs( signal - signal.mean() ) )
#points will be a list of points where pulses occur, voiced_intervals will
#be a list of tuples consisting of voiced intervals with overlap
#eliminated
points, voiced_intervals = [], []
#f_times will be an array of times corresponding to our given frequencies,
#to be used for interpolating, v_time be an array consisting of all the
#points in time that are voiced
f_times, v_time = np.array( [] ), np.array( [] )
total_time = np.linspace( 0, len( signal ) / float( rate ), len( signal ) )
for interval in intervals:
start, stop = interval
#finding all midpoints for each interval
f_times = add( ( f_times, ( start + stop ) / 2.0 ) )
i = 0
while i < len( intervals ) - 1 :
start, stop = intervals[ i ]
i_start, prev_stop = intervals[ i ]
#while there is overlap, look to the next interval
while start <= prev_stop and i < len( intervals ) - 1 :
prev_start, prev_stop = intervals[ i ]
i += 1
start, stop = intervals[ i ]
if i == len( intervals ) - 1:
samp = int ( ( stop - i_start ) * rate )
v_time = add( ( v_time, np.linspace( i_start, stop, samp ) ) )
voiced_intervals.append( ( i_start, stop ) )
else:
samp = int ( ( prev_stop - i_start ) * rate )
v_time = add( ( v_time, np.linspace( i_start, prev_stop, samp ) ) )
voiced_intervals.append( ( i_start, prev_stop ) )
#interpolate the periods so that each voiced point has a corresponding
#period attached to it
periods_interp = np.interp( v_time, f_times, period )
for interval in voiced_intervals:
start, stop = interval
midpoint = ( start + stop ) / 2.0
#out of all the voiced points, look for index of the one that is
#closest to our calculated midpoint
midpoint_index = np.argmin( abs( v_time - midpoint ) )
midpoint = v_time[ midpoint_index ]
T_0 = periods_interp[ midpoint_index ]
frame_start = midpoint - T_0
frame_stop = midpoint + T_0
#finding points, start by looking to the left of the center of the
#voiced interval
while frame_start > start :
#out of all given time points in signal, find index of closest to
#start and stop
frame_start_index = np.argmin( abs( total_time - frame_start ) )
frame_stop_index = np.argmin( abs( total_time - frame_stop ) )
frame = signal[ frame_start_index : frame_stop_index ]
if include_max and include_min:
p_index = np.argmax( abs( frame ) ) + frame_start_index
elif include_max:
p_index = np.argmax( frame ) + frame_start_index
else:
p_index = np.argmin( frame ) + frame_start_index
if abs( signal[ p_index ] ) > .02333 * global_peak:
points.append( total_time[ p_index ] )
t = total_time[ p_index ]
t_index = np.argmin( abs( v_time - t ) )
T_0 = periods_interp[ t_index ]
frame_start = t - 1.25 * T_0
frame_stop = t - 0.80 * T_0
T_0 = periods_interp[ midpoint_index ]
frame_start = midpoint - T_0
frame_stop = midpoint + T_0
#finding points by now looking to the right of the center of the
#voiced interval
while frame_stop < stop :
#out of all given time points in signal, find index of closest to
#start and stop
frame_start_index = np.argmin( abs( total_time - frame_start ) )
frame_stop_index = np.argmin( abs( total_time - frame_stop ) )
frame = signal[ frame_start_index : frame_stop_index ]
if include_max and include_min:
p_index = np.argmax( abs( frame ) ) + frame_start_index
elif include_max:
p_index = np.argmax( frame ) + frame_start_index
else:
p_index = np.argmin( frame ) + frame_start_index
if abs( signal[ p_index ] ) > .02333 * global_peak:
points.append( total_time[ p_index ] )
t = total_time[ p_index ]
t_index = np.argmin( abs( v_time - t ) )
T_0 = periods_interp[ t_index ]
frame_start = t + 0.80 * T_0
frame_stop = t + 1.25 * T_0
#returning an ordered array of points with any duplicates removed
return np.array( sorted( list( set( points ) ) ) )
[docs]def get_Jitter( signal, rate, period_floor = .0001, period_ceiling = .02,
max_period_factor = 1.3 ):
"""
Compute Jitter.
.. note::
Significant differences can occur in jitter and shimmer measurements
between different speaking styles, these differences make it possible to
use jitter as a feature for speaker recognition ( referenced below ).
Jitter is the measurement of random pertubations in period length. For most
accurate jitter measurements, calculations are typically only performed on
long sustained vowels. This algorithm calculates 5 different types of
jitter for all voiced intervals, each type of jitter describing different
characteristics of period pertubations. The 5 types of jitter are absolute
jitter, relative jitter, relative average perturbation ( rap ), the 5-point
period pertubation quotient ( ppq5 ), and the difference of differences of
periods ( ddp ).\n
Absolute jitter is defined as the cycle-to-cycle variation of
fundamental frequency, or in other words, the average absolute difference
between consecutive periods.
.. math::
\\frac{1}{N-1}\sum_{i=1}^{N-1}|T_i-T_{i-1}|
Relative jitter is defined as the average absolute difference between
consecutive periods ( absolute jitter ), divided by the average period.
.. math::
\\frac{\\frac{1}{N-1}\sum_{i=1}^{N-1}|T_i-T_{i-1}|}{\\frac{1}{N}\sum_{i=1}^N T_i}
Relative average perturbation is defined as the average absolute difference
between a period and the average of it and its two neighbors divided by the
average period.
.. math::
\\frac{\\frac{1}{N-1}\sum_{i=1}^{N-1}|T_i-(\\frac{1}{3}\sum_{n=i-1}^{i+1}T_n)|}{\\frac{1}{N}\sum_{i=1}^N T_i}
The 5-point period pertubation quotient is defined as the average absolute
difference between a period and the average of it and its 4 closest neighbors
divided by the average period.
.. math::
\\frac{\\frac{1}{N-1}\sum_{i=2}^{N-2}|T_i-(\\frac{1}{5}\sum_{n=i-2}^{i+2}T_n)|}{\\frac{1}{N}\sum_{i=1}^N T_i}
The difference of differences of periods is defined as the relative mean
absolute second-order difference of periods, which is equivalent to 3 times
rap.
.. math::
\\frac{\\frac{1}{N-2}\sum_{i=2}^{N-1}|(T_{i+1}-T_i)-(T_i-T_{i-1})|}{\\frac{1}{N}\sum_{i=1}^{N}T_i}
After each type of jitter has been calculated the values are
returned in a dictionary.
.. warning::
This algorithm has 4.2% relative error when compared to Praat's values.
This algorithm is adapted from:
http://www.lsi.upc.edu/~nlp/papers/far_jit_07.pdf
and from:
http://ac.els-cdn.com/S2212017313002788/1-s2.0-S2212017313002788-main.pdf?_tid=0c860a76-7eda-11e7-a827-00000aab0f02&acdnat=1502486243_009951b8dc70e35597f4cd19f8e05930
and from:
https://github.com/praat/praat/blob/master/fon/VoiceAnalysis.cpp
Args:
signal ( numpy.ndarray ): The signal the jitter will be calculated from.
rate ( int ): the rate per seconds that the signal was sampled at.
period_floor ( float ): ( optional, default value: .0001 ) the shortest possible interval that will be used in the computation of jitter, in seconds. If an interval is shorter than this, it will be ignored in the computation of jitter ( the previous and next intervals will not be regarded as consecutive ).
period_ceiling ( float ): ( optional, default value: .02 ) the longest possible interval that will be used in the computation of jitter, in seconds. If an interval is longer than this, it will be ignored in the computation of jitter ( the previous and next intervals will not be regarded as consecutive ).
max_period_factor ( float ): ( optional, default value: 1.3 ) the largest possible difference between consecutive intervals that will be used in the computation of jitter. If the ratio of the durations of two consecutive intervals is greater than this, this pair of intervals will be ignored in the computation of jitter ( each of the intervals could still take part in the computation of jitter in a comparison with its neighbor on the other side ).
Returns:
dict: a dictionary with keys: 'local', 'local, absolute', 'rap',
'ppq5', and 'ddp'. The values correspond to each type of jitter.\n
local jitter is expressed as a ratio of mean absolute period variation
to the mean period. \n
local absolute jitter is given in seconds.\n
rap is expressed as a ratio of the mean absolute difference between a
period and the mean of its 2 neighbors to the mean period.\n
ppq5 is expressed as a ratio of the mean absolute difference between a
period and the mean of its 4 neighbors to the mean period.\n
ddp is expressed as a ratio of the mean absolute second-order
difference to the mean period.
Example:
In the example below a synthesized signal is used to demonstrate random
perturbations in periods, and how get_Jitter responds.
>>> import numpy as np
>>> domain = np.linspace( 0, 6, 300000 )
>>> y = lambda x:( 1 - .3 * np.sin( 2 * np.pi * 140 * x ) ) * np.sin(
2 * np.pi * 140 * x )
>>> signal = y( domain ) + .2 * np.random.random( 300000 )
>>> rate = 50000
>>> get_Jitter( signal, rate )
{ 'ddp': 0.047411037373434134,
'local': 0.02581897560637415,
'local, absolute': 0.00018442618908563846,
'ppq5': 0.014805010237029443,
'rap': 0.015803679124478043 }
>>> get_Jitter( signal, rate, period_floor = .001,
period_ceiling = .01, max_period_factor = 1.05 )
{ 'ddp': 0.03264516540374475,
'local': 0.019927260366800197,
'local, absolute': 0.00014233584195389132,
'ppq5': 0.011472274162612033,
'rap': 0.01088172180124825 }
>>> y = lambda x:( 1 - .3 * np.sin( 2 * np.pi * 140 * x ) ) * np.sin(
2 * np.pi * 140 * x )
>>> signal = y( domain )
>>> get_Jitter( signal, rate )
{ 'ddp': 0.0015827628114371581,
'local': 0.00079043477724730755,
'local, absolute': 5.6459437833161522e-06,
'ppq5': 0.00063462518488944565,
'rap': 0.00052758760381238598 }
"""
pulses = get_Pulses( signal, rate )
periods = np.diff( pulses )
min_period_factor = 1.0 / max_period_factor
#finding local, absolute
#described at:
#http://www.fon.hum.uva.nl/praat/manual/PointProcess__Get_jitter__local__absolute____.html
sum_total = 0
num_periods = len( pulses ) - 1
for i in range( len( periods ) - 1 ):
p1, p2 = periods[ i ], periods[ i + 1 ]
ratio = p2 / p1
if (ratio < max_period_factor and ratio > min_period_factor and
p1 < period_ceiling and p1 > period_floor and
p2 < period_ceiling and p2 > period_floor ):
sum_total += abs( periods[ i + 1 ] - periods[ i ] )
else: num_periods -= 1
absolute = sum_total / ( num_periods - 1 )
#finding local,
#described at:
#http://www.fon.hum.uva.nl/praat/manual/PointProcess__Get_jitter__local____.html
sum_total = 0
num_periods = 0
#duplicating edges so there is no need to test edge cases
periods = np.hstack( ( periods[ 0 ], periods, periods[ -1 ] ) )
for i in range( len( periods ) - 2):
p1, p2, p3 = periods[ i ], periods[ i + 1 ], periods[ i + 2 ]
ratio_1, ratio_2 = p1 / p2, p2 / p3
if (ratio_1 < max_period_factor and ratio_1 > min_period_factor and
ratio_2 < max_period_factor and ratio_2 > min_period_factor and
p2 < period_ceiling and p2 > period_floor ):
sum_total += p2
num_periods += 1
#removing duplicated edges
periods = periods[ 1 : -1 ]
avg_period = sum_total / ( num_periods )
relative = absolute / avg_period
#finding rap
#described at:
#http://www.fon.hum.uva.nl/praat/manual/PointProcess__Get_jitter__rap____.html
sum_total = 0
num_periods = 0
for i in range( len( periods ) - 2 ):
p1, p2, p3 = periods[ i ], periods[ i + 1 ], periods[ i + 2 ]
ratio_1, ratio_2 = p1 / p2, p2 / p3
if (ratio_1 < max_period_factor and ratio_1 > min_period_factor and
ratio_2 < max_period_factor and ratio_2 > min_period_factor and
p1 < period_ceiling and p1 > period_floor and
p2 < period_ceiling and p2 > period_floor and
p3 < period_ceiling and p3 > period_floor ):
sum_total += abs( p2 - ( p1 + p2 + p3 ) / 3.0 )
num_periods += 1
rap = ( sum_total / num_periods ) / avg_period
#finding ppq5
#described at:
#http://www.fon.hum.uva.nl/praat/manual/PointProcess__Get_jitter__ppq5____.html
sum_total = 0
num_periods = 0
for i in range( len( periods ) - 4 ):
p1, p2, p3 = periods[ i ], periods[ i + 1 ], periods[ i + 2 ]
p4, p5 = periods[ i + 3 ], periods[ i + 4 ]
ratio_1, ratio_2, ratio_3, ratio_4 = p1 / p2, p2 / p3, p3 / p4, p4 / p5
if (ratio_1 < max_period_factor and ratio_1 > min_period_factor and
ratio_2 < max_period_factor and ratio_2 > min_period_factor and
ratio_3 < max_period_factor and ratio_3 > min_period_factor and
ratio_4 < max_period_factor and ratio_4 > min_period_factor and
p1 < period_ceiling and p1 > period_floor and
p2 < period_ceiling and p2 > period_floor and
p3 < period_ceiling and p3 > period_floor and
p4 < period_ceiling and p4 > period_floor and
p5 < period_ceiling and p5 > period_floor ):
sum_total += abs( p3 - ( p1 + p2 + p3 +p4 + p5 ) / 5.0 )
num_periods += 1
ppq5 = ( sum_total / num_periods ) / avg_period
#Praat calculates ddp by multiplying rap by 3
#described at:
#http://www.fon.hum.uva.nl/praat/manual/PointProcess__Get_jitter__ddp____.html
return { 'local' : relative, 'local, absolute' : absolute, 'rap' : rap,
'ppq5' : ppq5, 'ddp' : 3 * rap }