Exhibition Program

Science of Media Information


Changing your voice and speaking style

- Voice and prosody conversion with sequence-to-sequence model -


We propose an voice and prosody conversion method for impersonating a desired speaker’s identity and hiding a speaker’s identity. The conversion method consists of acoustic feature conversion and time-domain neural postfilter. The acoustic feature conversion is based on a sequence-to-sequence learning with attention mechanism, which makes it possible to capture the long-range temporal dependencies between source and target sequences. The later post filter employs a cyclic model based on adversarial networks, which requires no assumption for the speech waveform modeling. In contrast to current voice conversion techniques, the proposed method makes it possible to convert not only voice timbre but also prosody and rhythm while achieving highquality speech waveform generation due to the proposed time-domain neural post filter. The remaining challenge is the real-time voice conversion which is our ongoing work.


  • [1] K. Tanaka, H. Kameoka, T. Kaneko, N. Hojo, ” AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms," in Proc. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2019), May 2019.
  • [2] K. Tanaka, H. Kameoka, T. Kaneko, N. Hojo,“WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation," arXiv:1904.02892, Apr. 2019, (submitted to INTERSPEECH2019.)




Kou Tanaka, Learning and Intelligent Systems Research Group, Innovative Communication Laboratory