Speech information processing
Technology
Speech information processing technology handles speech as signal data for analysis and processing, such as recognition and synthesis. In recent years, this technology has evolved dramatically with the introduction of deep learning. NTT has been at the forefront of research and development for over 50 years.
NTT Human Informatics Laboratories focus on research and development of "speech recognition" and "speech synthesis". Speech recognition technology is a technology that not only recognizes the content of speech but also estimates the inner thoughts of the person. We aim to understand others by recognizing their inner thoughts. Speech synthesis technology converts text content into natural-sounding speech, and at the same time, reproduces the speaker's characteristics, emotions, and speaking style using a small amount of speech data as clues.
Through research into speech information processing, NTT Human Informatics Laboratories aim to realize technology that improves the quality of human thinking and communication.
Research
- Speech Recognition (under construction)
- Speech Synthesis (under construction)
Publications
2023
Journal Papers
- Hiroshi Sato, Yusuke Shinohara, Atsunori Ogawa, "Multi-modal modeling for device-directed speech detection using acoustic and linguistic cues", Acoustical Science and Technology, Acousitc Letters, vol.44, no.1, pp.40-43, 2023.
- Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki, "Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection", IEEE Access, 2023 (to be appeared).
Conference Papers
- Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, "Scheduled Sampling for Neural Transducer-based ASR" , In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023 (to be appeared).
- Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, "AN EXPLORATION OF LANGUAGE DEPENDENCY FOR JAPANESE SELF-SUPERVISED SPEECH REPRESENTATION MODELS", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023 (to be appeared).
- Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Marc Delcroix, Atsunori Ogawa, Ryo Masumura, "LEVERAGING LARGE TEXT CORPORA FOR END-TO-END SPEECH SUMMARIZATION", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023 (to be appeared).
- Hiroki Kanagawa, and?Yusuke Ijima, ``ENHANCEMENT OF TEXT-PREDICTING STYLE TOKEN WITH GENERATIVE ADVERSARIAL NETWORK FOR EXPRESSIVE SPEECH SYNTHESIS,'' Proc.?ICASSP, 2023 (accepted)
- Hiroki Kanagawa, and?Yusuke Ijima, ``SIMD-SIZE AWARE WEIGHT REGULARIZATION FOR FAST NEURAL VOCODING ON CPU,'' Proc.?2022 IEEE Spoken Language Technology Workshop (SLT 2022),?Jan.?2023.
2022
Journal Papers
- Mizuki Nagano, Yusuke Ijima, and Sadao Hiroya, ``Perceived Emotional States Mediate Willingness to Buy from Advertising Speech,'' Frontiers in Psychology, Dec. 2022. https://doi.org/10.3389/fpsyg.2022.1014921
Conference Papers
- Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya, "Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition", In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6287-6291, 2022
- Takafumi Moriya, Takanori Ashihara, Atsushi Ando, Hiroshi Sato, Tomohiro Tanaka, Kohei Matsuura, Ryo Masumura, Marc Delcroix, Takahiro Shinozaki, "Hybrid RNN-T/Attention-based Streaming ASR with Triggered Chunkwise Attention and Dual Internal Language Model Integration", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.8282-8286, 2022.
- Atsushi Ando, Yumiko Murata, Ryo Masumura, Satoshi Suzuki, Naoki Makishima, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, "Customer Satisfaction Estimation using Unsupervised Representation Learning with Multi-Format Prediction Loss", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.8497-8501, 2022.
- Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka and Ryo Masumura, "Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.996-1000, 2022.
- Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki, "Streaming Target-Speaker ASR with Neural Transducer", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.2673-2677, 2022.
- Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, "Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.411-415, 2022.
- Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, "On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis", In Proc. IEEE Spoken Language Technology Workshop (SLT), 2022.
- Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Taichi Asami, "ON-DEVICE STREAMING TARGET-SPEAKER ASR WITH NEURAL TRANSDUCER", IEEE Spoken Language Technology Workshop (SLT), 2022.
- Kenichi Fujita, Yusuke Ijima, and Hiroaki Sugiyama, ``Direct speech-reply generation from text-dialogue context,'' Proc. APSIPA Annual Summit and Conference 2022, Nov. 2022.
- Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, and Hiroshi Saruwatari, ``Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis,'' Proc. INTERSPEECH 2022, pp. 4551--4555, Sept. 2022.
- Hiroki Kanagawa, Yusuke Ijima, and Hiroyuki Toda, ``Joint Modeling of Multi-Sample and Subband Signals for Fast Neural Vocoding on CPU,'' Proc. INTERSPEECH 2022, pp. 1626--1630, Sept. 2022.
- Hiroki Kanagawa, and Yusuke Ijima, ``Multi-Sample Subband Wavernn Via Multivariate Gaussian,'' Proc. ICASSP, pp. 8427--8431, May 2022.
2021
Journal Papers
- Atsushi Ando, Takeshi Mori, Satoshi Kobashikawa, Tomoki Toda, "Speech emotion recognition based on listener-dependent emotion perception models", APSIPA Transactions on Signal and Information Processing, Vol.10, No.1, 2021.
- Yuki Saito, Taiki Nakamura, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke Takamichi, ``Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification,'' Acoustical Science and Technology, Vol. 42, No. 1, pp. 1-11, Jan. 2021.
- Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, and Yusuke Ijima, ``Model architectures to extrapolate emotional expressions in DNN-based text-to-speech,'' Speech Communication, Elsevier, Vol. 126, pp. 35-43, Jan. 2021.
Conference Papers
- Atsushi Ando, Ryo Masumura, Hiroshi Sato, Takafumi Moriya, Takanori Ashihara, Yusuke Ijima, Tomoki Toda, "Speech Emotion Recognition based on Listener Adaptive Models", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 6274-6278, 2021.
- Takafumi Moriya, Takanori Ashihara, Tomohiro Tanaka, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Yusuke Ijima, Ryo Masumura, Yusuke Shinohara, "SIMPLEFLAT: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 5664-5668, 2021.
- Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, "Multimodal Attention Fusion for Target Speaker Extraction", in Proc. IEEE Spoken Language Technology Workshop (SLT), pp. 778-784, 2021.
- Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo, "Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 1149-1153, 2021
- Takafumi Moriya, Tomohiro Tanaka, Takanori Ashihara, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Ryo Masumura, Marc Delcroix and Taichi Asami, "Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 1787-1791, 2021.
- Takanori Ashihara, Takafumi Moriya, Makio Kashino, "Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance"", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 1757-1761, 2021."
- Naohiro Tawara, Atsunori Ogawa, Yuki Kitagishi, Hosana Kamiyama, and Yusuke Ijima, ``Robust Speech-Age Estimation Using Local Maximum Mean Discrepancy Under Mismatched Recording Conditions,'' Proc. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 114--121, Dec. 2021.
- Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Naoko Tanji, Yusuke Ijima, Ryo Masumura, and Hiroshi Saruwatari, ``Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings,'' Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), pp. 211--215, Sept. 2021.
- Kenichi Fujita, Atsushi Ando, and Yusuke Ijima, ``Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis,'' Proc. INTERSPEECH 2021, pp. 3141-3145, Sept. 2021.
- Naoto Kakegawa, Sunao Hara, Masanobu Abe, and Yusuke Ijima, ``Phonetic and prosodic information estimation from texts for genuine Japanese end-to-end text-to-speech,'' Proc. INTERSPEECH 2021, pp. 3606--3610, Sept. 2021.
- Mizuki Nagano, Yusuke Ijima, and Sadao Hiroya, ``Impact of Emotional State on Estimation of Willingness to Buy from Advertising Speech,'' Proc. INTERSPEECH 2021, pp. 2486--2490, Sept. 2021.
- Atsushi Ando, Ryo Masumura, Hiroshi Sato, Takafumi Moriya, Takanori Ashihara, Yusuke Ijima, and Tomoki Toda, ``Speech Emotion Recognition Based on Listener Adaptive Models,'' Proc. ICASSP 2021, pp. 6274--6278, June 2021.
- Takafumi Moriya, Takanori Ashihara, Tomohiro Tanaka, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Yusuke Ijima, Ryo Masumura, and Yusuke Shinohara, ``Simpleflat: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition,'' Proc. ICASSP 2021, pp. 5664--5668, June 2021.
2020
Journal Papers
- Hosana Kamiyama, Atsushi Ando, Ryo Masumura, Satoshi Kobashikawa, Yushi Aono, "Likability estimation for contact center agents by selecting annotators based on binomial distribution", Acoustical Science and Technology, Acousitc Letters, vol.41, no.6, pp.826-828, 2020.
- Atsushi Ando, Ryo Masumura, Hosana Kamiyama, Satoshi Kobashikawa, Yushi Aono, Tomoki Toda, "Customer Satisfaction Estimation in Contact Center Calls Based on a Hierarchical Multi-Task Model", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.28, pp.715-728, 2020.
Conference Papers
- Takafumi Moriya, Hiroshi Sato, Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, "Distilling Attention Weights for CTC-based ASR Systems", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 6889-6893, 2020.
- Takafumi Moriya, Tsubasa Ochiai, Shigeki Karita, Hiroshi Sato,Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, Marc Delcroix, "Self-Distillation for Improving CTC-Transformer-based ASR Systems ", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.546-550, 2020.
- Yuki Kitagishi, Hosana Kamiyama, Atsushi Ando, Naohiro Tawara, Takeshi Mori, and Satoshi Kobashikawa, "Speaker age estimation using age-dependent insensitive loss", In Proc. APSIPA, pp. 319-324, Dec. 2020.
- Hiroki Kanagawa and Yusuke Ijima, ``Lightweight LPCNet-based Neural Vocoder with Tensor Decomposition,'' Proc. Interspeech 2020, pp. 205-209, Oct. 2020.
- Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, and Hiroshi Saruwatari, ``Investigating Effective Additional Contextual Factors in DNN-based Spontaneous Speech Synthesis,'' Proc. Interspeech 2020, pp. 3201-3205, Oct. 2020.
- Nobukatsu Hojo, Yusuke Ijima, Hiroaki Sugiyama, Noboru Miyazaki, Takahito Kawanishi, and Kunio Kashino, ``DNN-based Speech Synthesis considering Dialogue-Act Information and its Evaluation with Respect to Illocutionary Act Naturalness,'' Proc. Speech Prosody 2020, Tokyo, Japan, May 2020.
- Takuya Ozuru, Yusuke Ijima, Daisuke Saito and Nobuaki Minematsu, ``Are you professional?: Analysis of prosodic features between a newscaster and amateur speakers through partial substitution by DNN-TTS,'' Proc. Speech Prosody 2020, Tokyo, Japan, May 2020.
- Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, and Hiroshi Saruwatari, ``DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus,'' Proc. LREC 2020, pp. 6438-6443, May 2020.
2019
Conference Papers
- Hosana Kamiyama, Atsushi Ando, Ryo Masumura, Satoshi Kobashikawa, Yushi Aono, "Likability Estimation of Call-center Agents by Suppressing Annotator Variability", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.911-916, 2019.
- Hosana Kamiyama, Atsushi Ando, Ryo Masumura, Satoshi Kobashikawa, Yushi Aono, "Urgent Voicemail Detection Focused on Long-term Temporal Variation", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.917-921, 2019.
- Ryo Masumura, Kiyoaki Matsui, Yuma Koizumi, Takaaki Fukutomi, Takanobu Oba, Yushi Aono, "Context-Aware Neural Voice Activity Detection Using Auxiliary Networks for Phoneme Recognition, Speech Enhancement and Acoustic Scene Classification ", In Proc. European Signal Processing Conference (EUSIPCO), 2019.
- Ryo Masumura, Tomohiro Tanaka, Atsushi Ando, Hosana Kamiyama,Takanobu Oba, Satoshi Kobashikawa, Yushi Aono,"Improving Conversation-Context Language Models with Multiple Spoken Language Understanding Models", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.834-838, 2019.
- Ryo Masumura, Hiroshi Sato, Tomohiro Tanaka, Takafumi Moriya, Yusuke Ijima, Takanobu Oba, "End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.1606-1610, 2019.
- Ryo Masumura, Yusuke Ijima, Satoshi Kobashikawa, Takanobu Oba, Yushi Aono, "Can We Simulate Generative Process of Acoustic Modeling Data? Towards Data Restoration for Acoustic Modeling", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.655-661, 2019.
- Takafumi Moriya, Jian Wang, Tomohiro Tanaka, Ryo Masumura, Yusuke Shinohara, Yoshikazu Yamaguchi, Yushi Aono, "Joint Maximization Decoder with Neural Converters for Fully Neural Network-based Japanese Speech Recognition", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.4410-4414, 2019.
- Tomohiro Tanaka, Ryo Masumura, Takafumi Moriya, Takanobu Oba, Yushi Aono, "A Joint End-to-End and DNN-HMM Hybrid Automatic Speech Recognition System with Transferring Shared Knowledge", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.2210-2214, 2019.
- Tomohiro Tanaka, Ryo Masumura, Takafumi Moriya, Takanobu Oba, Yushi Aono, "Disfluency Detection Based on Speech-Aware Token-by-Token Sequence Labeling with BLSTM-CRFs and Attention Mechanisms", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.1009-1013, 2019.
- Hiroshi Sato, Takafumi Moriya, Yusuke Shinohara, Ryo Masumura, Takaaki Fukutomi, Kiyoaki Matsui, Takanori Ashihara, Yoshikazu Yamaguchi, Yushi Aono , "Revisiting Dynamic Adjustment of Language Model Scaling Factor for Automatic Speech Recognition", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.186-191, 2019.
- Taiki Nakamura, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, and Hiroshi Saruwatari, ``V2S attack: building DNN-based voice conversion from automatic speaker verification,'' Proc. 10th ISCA Speech Synthesis Workshop. pp. 161--165, Vienna, Austria, Sept. 2019.
- Hiroki Kanagawa and Yusuke Ijima, ``Multi-Speaker Modeling for DNN-based Speech Synthesis Incorporating Generative Adversarial Networks,'' Proc. 10th ISCA Speech Synthesis Workshop. pp. 40--44, Vienna, Austria, Sept. 2019.
- Ryo Masumura, Hiroshi Sato, Tomohiro Tanaka, Takafumi Moriya, Yusuke Ijima, and Takanobu Oba, ``End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders,'' Proc. Interspeech 2019, pp. 1606-1610, Graz, Austria, Sept. 2019.