DIRECT MODELLING OF MAGNITUDE AND PHASE SPECTRA FOR STATISTICAL PARAMETRIC
SPEECH SYNTHESIS – Interspeech 2017
(Felipe Espic, Cassia Valentini-Botinhao, Simon King / CSTR, University of Edinburgh, UK)
We propose a simple new representation for the FFT spectrum tailored to statistical parametric speech synthesis. It consists of four feature streams that describe magnitude, phase and fundamental frequency using real numbers. The proposed feature extraction method does not attempt to decompose the speech structure (e.g., into source+filter or harmonics+noise). By avoiding the simplifications inherent in decomposition, we can dramatically reduce the “phasiness” and “buzziness” typical of most vocoders. The method uses simple and computationally cheap operations and can operate at a lower frame rate than the 200 frames-per-second typical in many systems. It avoids heuristics and methods requiring approximate or iterative solutions, including phase unwrapping.
Two DNN-based acoustic models were built – from male and female speech data – using the Merlin toolkit. Subjective comparisons were made with a state-of-the-art baseline, using the STRAIGHT vocoder. In all variants tested, and for both male and female voices, the proposed method substantially outperformed the baseline. We provide source code to enable our complete system to be replicated.
You have free access to the paper presented at Interspeech 2017.
Code and Merlin scripts
The new analysis/synthesis system implemented during this research is named MagPhase Vocoder, due to its capability of encoding magnitude and phase spectra. You can download the MagPhase Vocoder + Merlin scripts used in the experiments from the official CSTR GitHub account.
Samples to support the Interspeech 2017 submission titled above. This page contains the following samples:
Nat: Natural speech (the hidden reference).
Base: The Baseline system running at constant frame rate and using STRAIGHT for analysis/synthesis.
PM: The Proposed Method with settings as described in the paper.
PMVNAp: The Proposed Method with Voiced segments having No Aperiodic component.
PMVNApW: The Proposed Method with Voiced segments having No Aperiodicity Window.