We present our speech processing technologies developed for conversational speech recognition. Specifically, our focus is on techniques for speaker activity estimation (estimation of each speaker’s talking periods), because they play an important role in conversational speech recognition. As shown here, we can enhance a target speech signal from a recorded conversational speech signal by controlling a speech enhancement process according to the estimated speaker activities. It is also possible to improve the speech recognition accuracy by introducing the speaker activity information, including turn-taking information, into the language model in a speech recognition system. Our newly-developed speaker activity estimation method, which is based on a probabilistic model of speaker spatial information, is also presented. With these technologies, we contribute to realizing a more natural voice interface for our daily speech communication.
Please click the thumbnail image to open the full-size PDF file.