Transformation functions

phoneshift.transform(wav: ndarray, fs: float, ts_pbf: float = 1.0, esf: float = 1.0, esp: boolean = True, psf: float = 1.0, psf_max: float = 2.0, clipper_knee: float = 0.66, winlen_inner: float = 0.020 * fs, timestep: float = 0.005 * fs, f0_min: float = 27.5, f0_max: float = 3520) ndarray[float32]

This is the generic function to transform a voice signal while applying multiple audio effects. See also below for more functions dedicated to specific tasks.

Note

It assumes the signal is monophonic, like a voice, a flute, a violin, saxophone, etc.

It is not recommended to use it on polyphonic signals like a piano, a guitar, a drum set, etc.

Parameters:
  • wav – Input signal. Currently, spacialisation in a multichannel signal is not preserved. Multichannel signals are averaged through the channel dimension, processed and then duplicated to the same number of channels.

  • fs – Sampling rate [Hz].

  • ts_pbf – Playback factor to do time scaling [coefficient, def. 1.0].

  • esf – Envelope scaling factor [coefficient, def. 1.0].

  • esp – Preserve spectral envelope [boolean, def. True]. Also known as “formants preservation”.

  • psf – Pitch scaling factor [coefficient, def. 1.0].

  • psf_max – Maximum value for pitch scaling factor [coefficient, def. 2.0].

  • clipper_knee – Clipper knee amplitude [linear amplitude, def. 0.66, source]. This is to prevent the signal to clip at 1.0 when saving it in a file and create audio glitches. The knee amplitude is the point where the clipper starts to act. This will prevent the signal to go above ±1.0 in amplitude. The lower the value, the less glitches but the more the signal will be distorted. Set it to None to disable it.

Note

The following arguments are used to optimize the processing’s audio quality and speed. They are not recommended to be changed unless you know what you are doing.

You can use transform_timescaling and transform_pitchscaling that will automatically do it for for you depending on the task.

Parameters:
  • winlen_inner – Inner window length [#samples, def. 0.020*fs]. This is the window length used for the inner processing. The bigger the value, the more stable the sound but the processing will be slower.

  • timestep – Inner window length [#samples, def. 0.005*fs]. This is the time step from one frame to the next. The smaller the value, the more stable the sound but the processing will be slower.

  • f0_min – Minimum value for the fundamental frequency [Hz, def. 440/16=27.5]. This is to prevent the pitch to go too low and create audio glitches.

  • f0_max – Maximum value for the fundamental frequency [Hz, def. 440*8=3520]. This is to prevent the pitch to go too high and create audio glitches.

Returns:

  • ndarray[float32] - The modified signal.

    Shape will be the same as the input signal. The type will always be float32 since the whole processing runs on float32 precision.

Examples:

import phoneshift
import soundfile
wav, fs = soundfile.read('path/to/audio.wav')
syn = phoneshift.transform(wav, fs, psf=2.0)
soundfile.write('syn.wav', syn, fs)
Processing flow:

This function is based on an Overlap-Add process whose base implementation is freely available here.

_images/vocoder_time.svg
phoneshift.transform_timescaling(wav: ndarray, fs: float, **kwargs)

Same arguments and return values as transform().

This function alter a few other arguments of transform() in order to optimize speed for time scaling, without compromising audio quality.

phoneshift.transform_pitchscaling(wav: ndarray, fs: float, **kwargs)

Same arguments and return values as transform().

This function alter a few other arguments of transform() in order to optimize speed for pitch scaling only, without compromising audio quality.