Abstract
We propose a method to accurately estimate “who spoke when" based on speaker's voice characteristics. It
works even in a situation where multiple speaker’s speech signals overlap, and accurately counts the number of
speakers in such cases. Conventional methods with the similar functionality works only when the observed signal
satisfies certain a priori (unrealistic) assumptions (e.g. the number of speaker known in advance, speakers never
change their locations). However, these assumptions cannot be often satisfied in realistic scenarios, which leads
to performance degradation. On the other hand, the proposed method, which is based purely on deep learning,
can theoretically learn and deal with any realistic conversation situations. It is expected to serve as a fundamental
technology for automatic conversation analysis systems, and will contribute to realization of automatic meeting
minutes generation systems and communication robots.
Keisuke Kinoshita, Processing Research group, Media Information Laboratory
Email: