Media Intelligence

Analyzing, synthesizing and converting speech prosody

- Generative modeling of voice fundamental frequency contours -

Abstract

Linear Predictive Coding (LPC), proposed in the 60s, has established the modern speech analysis/synthesis framework and has opened the door of today’s mobile and VoIP communication technology. While LPC has realized the analysis/synthesis framework focusing on the 'phonemic' factor of speech, the aim of this work is to develop a new analysis/synthesis framework focusing on the 'prosodic' factor. Although a well-founded physical model for vocal fold vibration was proposed in the 60s by Fujisaki (known as the "Fujisaki model"), how to estimate the underlying parameters has long been a difficult task. We have developed a stochastic counterpart of the Fujisaki model, which made it possible to apply powerful statistical inference techniques to accurately estimate the underlying parameters. This model has a high potential to be developed into a next-generation module for Text-to-Speech, speech analysis, synthesis and conversion systems.

Photos

Poster

Please click the thumbnail image to open the full-size PDF file.

Presenters

Hirokazu Kameoka
Media Information Laboratory

Takuhiro Kaneko
Media Information Laboratory