Exhibition Program

Science of Media Information


Face-to-voice conversion and voice-to-face conversion

- Crossmodal voice conversion with deep generative models -


Humans are able to imagine a person's voice from the person's appearance and imagine the person's appearance from his/her voice. In this work, we take an information-theoretic approach using deep generative models to develop a method that can convert speech into a voice that matches an input face image and generate a face image that matches the voice of the input speech by leveraging the correlation between faces and voices. We propose a model, consisting of a speech encoder/decoder, a face encoder/decoder and a voice encoder. We use the latent code of an input face image encoded by the face encoder as the auxiliary input into the speech decoder and train the speech encoder/decoder so that the original latent code can be recovered from the generated speech by the voice encoder. We also train the face decoder along with the face encoder to ensure that the latent code will contain sufficient information to reconstruct the input face image.


  • [1] H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, "StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks," in Proc. 2018 IEEE Workshop on Spoken Language Technology (SLT 2018), pp. 266-273, Dec. 2018.
  • [2] H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, “ ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder, ” arXiv:1808.05092 [stat.ML], 2018.
  • [3] H. Kameoka, K. Tanaka, A. Valero Puche, Y. Ohishi, T. Kaneko, “Crossmodal Voice Conversion,” arXiv:1904.04540 [cs.SD], 2019.




Hirokazu Kameoka, Media Recognition Research Group, Media Information Laboratory