Learning speech recognition from small paired data
- Semi-supervised end-to-end training with text-to-speech -
Abstract
We propose a semi-supervised end-to-end method for learning speech recognition from small paired data and
large unpaired data. This is because preparing the paired data of a speech and its transcription text requires a
large amount of human effort. In our method, we introduce speech and text autoencoders that share encoders
and decoders with an automatic speech recognition (ASR) model to improve ASR performance using speech-only
and text-only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and
text-to-speech (TTS) encoder-decoder architectures. These autoencoders learn features from speech-only and
text-only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously,
they aim to encode features to be compatible with ASR and TTS models using a multi-task loss.
References
[1] S. Karita, S. Watanabe, T. Iwata, A. Ogawa, M. Delcroix, “Semi-supervised end-to-end speech recognition,” in Proc. of 2018 Interspeech,
pp. 2-6, 2018.
[2] S. Karita, S. Watanabe, T. Iwata, M. Delcroix, A. Ogawa, T. Nakatani, “Semi-supervised end-to-end speech recognition using text-to-speech
and autoencoders,” in Proc. of 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019.