- Voice and prosody conversion with sequence-to-sequence model -
Abstract
We propose an voice and prosody conversion method for impersonating a desired speaker’s identity and hiding a
speaker’s identity. The conversion method consists of acoustic feature conversion and time-domain neural
postfilter. The acoustic feature conversion is based on a sequence-to-sequence learning with attention
mechanism, which makes it possible to capture the long-range temporal dependencies between source and
target sequences. The later post filter employs a cyclic model based on adversarial networks, which requires no
assumption for the speech waveform modeling. In contrast to current voice conversion techniques, the proposed
method makes it possible to convert not only voice timbre but also prosody and rhythm while achieving highquality
speech waveform generation due to the proposed time-domain neural post filter. The remaining challenge
is the real-time voice conversion which is our ongoing work.
References
[1] K. Tanaka, H. Kameoka, T. Kaneko, N. Hojo, ” AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation
Mechanisms," in Proc. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2019), May 2019.
[2] K. Tanaka, H. Kameoka, T. Kaneko, N. Hojo,“WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation," arXiv:1904.02892,
Apr. 2019, (submitted to INTERSPEECH2019.)