Automatic speech recognition (ASR) technology endeavors to endow computers with the primary human communication channel: speech. Given an audio signal, ASR systems identify segments that contain a speech signal. For each speech segment, a sequence of information-rich feature vectors is extracted and matched to the most likely word sequence given a set of previously trained speech models that reflects the salient features of a languages’ phonemes. Although they may correspond to the same spoken word content, real world speech signals vary greatly depending on the speaker and on the acoustic environment, especially for spontaneous speech in natural human conversations, where speech patterns are highly diverse, ambiguous, and incomplete. Human beings can absorb this variation remarkably well and instantly supply the missing components. Our research on speech recognition technology aims to bring to computers the human faculties of high performance, high speed, and robust speech recognition.
Speech recognition technology has many applications, including the control of consumer electronic devices, the automatic generation of meeting minutes, and the captioning of audio/visual content. With effective ASR, conversational robots that can understand human speech could become a reality. Leveraging the high-speed data processing ability of computers, large amounts of speech data can be analyzed, organized, summarized, and translated. In the future, speech recognition technology will be used to give ears to robots, and speech analyzers will be part of daily life. To make ASR effective for various applications, we are improving the algorithms for speech analysis, model training, search, and backend processing.
In real world environments, the interference originating from various kinds of ambient noise complicates automatic speech recognition (ASR) for accurately transcribing audio signals. Speech enhancement techniques improve the audible quality of speech signals. However, the improvement in terms of recognition accuracy is usually limited because the enhanced speech inevitably includes distortion and residual noise. We have proposed a method to mitigate these effects on ASR by dynamically compensating the acoustic model parameters based on a reliability estimation of the enhanced speech signals. This method enables us to integrate various speech enhancement techniques with ASR.
We have established several methods for constructing highly accurate acoustic and language models, such as dMMI (*1)-based discriminative acoustic model training and R2D2 (*2)-based discriminative language model training methods. These methods provide models with better than ever discrimination and generalization performance. We have also developed a WFST (*3)-based linear classifier that can be trained directly on a WFST composed of acoustic and language models. This reduces recognition errors that cannot be recovered using only separately trained acoustic and language models.
Since ASR is used in a variety of situations, it needs to handle multi-speaker conversations on a wide variety of topics while remaining sufficiently fast and memory efficient. Speech applications must both convert speech signals into text and recognize such rich information as the speaker, topic, circumstance, and confidence of the ASR result. We have developed a very fast algorithm that achieves real-time ASR with a 10 million word vocabulary using WFSTs. We have also developed topic tracking language models and a confidence estimation method with error causes for ASR results.