Speech is one of the most natural and useful media for human communication. If a computer could appropriately handle speech signals in our daily lives, it could provide us with more convenient and comfortable speech services. However, when a speech signal is captured by distant microphones, background noise and reverberation contaminate the original signal and severely degrade the performance of existing speech applications. To overcome such limitations, we are investigating methodologies for automatically detecting individual speech signals in captured signals (scene analysis) and recovering their original quality (speech enhancement). Our research goal is thus to establish techniques that extract such information as “who spoke when” from human communication scenes and enable various speech applications to work appropriately in the real world.
For existing speech applications, close-talk microphones are essential for reducing the influence of background noise and reverberation. Obviously, using such microphones is not convenient in our daily lives. In contrast, if a computer can automatically analyze communication scenes using distant microphones and precisely extract individual speech signals, we can eliminate the need for close-talk microphones for such applications. For example, we will be able to control electrical appliances from a distance by speaking, and talk to autonomous robots. For an automatic meeting-minutes generation system, we may not need microphones for individual speakers, but only a microphone array in the meeting room. For human-human communications, the influence of noise and reverberation must be reduced to improve speech intelligibility. Such techniques can be applied to remote conference and mobile communication systems.
Techniques for understanding audio scenes are needed to allow us to establish more convenient speech communication. We are developing a method for speaker indexing that estimates “who spoke when” in a meeting with multiple participants. This technique is realized with a noise robust voice activity detection (VAD) technique that estimates “when the utterance was spoken” and a direction of arrival (DOA) estimation technique that estimates “from which direction the utterance was spoken”.
Speech enhancement, including noise reduction and dereverberation, is a key technology for achieving accurate automatic speech recognition (ASR) in a real world environment. Although many enhancement methods have already been proposed, most of them could not efficiently improve the ASR performance, since there is often a mismatch between speech processed using these methods and the speech model in the ASR system. To solve this problem and obtain appropriate results, we investigate enhancement algorithms that guide their outputs according to the statistical speech model.
What we hear in our daily lives consists of a variety of sounds from different sources. This research is aimed at decomposing such complicated sound mixtures into a set of individual source factors and propagation factors, which characterize the direction and reverberation of each source. This technology serves as a tool for unifying different types of audio signal processing such as acoustic noise reduction, dereverberation, and sound source localization. The outcome of this research has been used in our meeting analysis system.