音声情報処理
技術カテゴリ紹介
音声情報処理は、音声情報を信号データとしてコンピュータで取り扱い、分析・認識・合成などの情報処理を行う技術であり、近年では深層学習の導入により飛躍的に進化を遂げた分野の一つです。NTTでは50年以上にわたり第一線で研究開発を推進しています。
人間研では音声情報処理の中でも「音声認識技術」および「音声合成技術」に注力して研究開発を推進しています。「音声認識技術」では音声情報を通じて、発話内容の認識や話し手の内面を推定することにより、他者の認識・理解を行うことをめざします。「音声合成技術」では発話内容だけでなく、感情や発話様式など多様な情報を含む音声を、少ない情報源(学習データ、発話テキスト等)から高精度に生成・変換することにより、他者への働きかけ・インタラクションをリッチに行うことをめざします。
これらの取り組みを通じて、人の思考やコミュニケーションの質を高める技術の実現をめざしています。
研究紹介
- 音声認識技術
- 音声合成・変換技術
査読付き文献リスト
2023
論文
- Hiroshi Sato, Yusuke Shinohara, Atsunori Ogawa, "Multi-modal modeling for device-directed speech detection using acoustic and linguistic cues", Acoustical Science and Technology, Acousitc Letters, vol.44, no.1, pp.40-43, 2023.
- Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki, "Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection", IEEE Access, 2023 (to be appeared).
国際会議
- Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, "Scheduled Sampling for Neural Transducer-based ASR" , In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023 (to be appeared).
- Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, "AN EXPLORATION OF LANGUAGE DEPENDENCY FOR JAPANESE SELF-SUPERVISED SPEECH REPRESENTATION MODELS", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023 (to be appeared).
- Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Marc Delcroix, Atsunori Ogawa, Ryo Masumura, "LEVERAGING LARGE TEXT CORPORA FOR END-TO-END SPEECH SUMMARIZATION", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023 (to be appeared).
- Hiroki Kanagawa, and Yusuke Ijima, ``ENHANCEMENT OF TEXT-PREDICTING STYLE TOKEN WITH GENERATIVE ADVERSARIAL NETWORK FOR EXPRESSIVE SPEECH SYNTHESIS,'' Proc. ICASSP, 2023 (accepted)
- Hiroki Kanagawa, and Yusuke Ijima, ``SIMD-SIZE AWARE WEIGHT REGULARIZATION FOR FAST NEURAL VOCODING ON CPU,'' Proc. 2022 IEEE Spoken Language Technology Workshop (SLT 2022), Jan. 2023.
2022
論文
- Mizuki Nagano, Yusuke Ijima, and Sadao Hiroya, ``Perceived Emotional States Mediate Willingness to Buy from Advertising Speech,'' Frontiers in Psychology, Dec. 2022. https://doi.org/10.3389/fpsyg.2022.1014921
国際会議
- Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya, "Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition", In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6287-6291, 2022
- Takafumi Moriya, Takanori Ashihara, Atsushi Ando, Hiroshi Sato, Tomohiro Tanaka, Kohei Matsuura, Ryo Masumura, Marc Delcroix, Takahiro Shinozaki, "Hybrid RNN-T/Attention-based Streaming ASR with Triggered Chunkwise Attention and Dual Internal Language Model Integration", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.8282-8286, 2022.
- Atsushi Ando, Yumiko Murata, Ryo Masumura, Satoshi Suzuki, Naoki Makishima, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, "Customer Satisfaction Estimation using Unsupervised Representation Learning with Multi-Format Prediction Loss", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.8497-8501, 2022.
- Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka and Ryo Masumura, "Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.996-1000, 2022.
- Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki, "Streaming Target-Speaker ASR with Neural Transducer", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.2673-2677, 2022.
- Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, "Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.411-415, 2022.
- Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, "On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis", In Proc. IEEE Spoken Language Technology Workshop (SLT), 2022.
- Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Taichi Asami, "ON-DEVICE STREAMING TARGET-SPEAKER ASR WITH NEURAL TRANSDUCER", IEEE Spoken Language Technology Workshop (SLT), 2022.
- Kenichi Fujita, Yusuke Ijima, and Hiroaki Sugiyama, ``Direct speech-reply generation from text-dialogue context,'' Proc. APSIPA Annual Summit and Conference 2022, Nov. 2022.
- Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, and Hiroshi Saruwatari, ``Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis,'' Proc. INTERSPEECH 2022, pp. 4551--4555, Sept. 2022.
- Hiroki Kanagawa, Yusuke Ijima, and Hiroyuki Toda, ``Joint Modeling of Multi-Sample and Subband Signals for Fast Neural Vocoding on CPU,'' Proc. INTERSPEECH 2022, pp. 1626--1630, Sept. 2022.
- Hiroki Kanagawa, and Yusuke Ijima, ``Multi-Sample Subband Wavernn Via Multivariate Gaussian,'' Proc. ICASSP, pp. 8427--8431, May 2022.
表彰
- 音声研究会研究奨励賞(2022年度)森谷 崇史, "Hybrid RNN-T/Attention構造を用いたストリーミング型End-to-End音声認識モデルと内部言語モデル統合の検討"
2021
論文
- Atsushi Ando, Takeshi Mori, Satoshi Kobashikawa, Tomoki Toda, "Speech emotion recognition based on listener-dependent emotion perception models", APSIPA Transactions on Signal and Information Processing, Vol.10, No.1, 2021.
- Yuki Saito, Taiki Nakamura, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke Takamichi, ``Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification,'' Acoustical Science and Technology, Vol. 42, No. 1, pp. 1-11, Jan. 2021.
- Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, and Yusuke Ijima, ``Model architectures to extrapolate emotional expressions in DNN-based text-to-speech,'' Speech Communication, Elsevier, Vol. 126, pp. 35-43, Jan. 2021.
国際会議
- Atsushi Ando, Ryo Masumura, Hiroshi Sato, Takafumi Moriya, Takanori Ashihara, Yusuke Ijima, Tomoki Toda, "Speech Emotion Recognition based on Listener Adaptive Models", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 6274-6278, 2021.
- Takafumi Moriya, Takanori Ashihara, Tomohiro Tanaka, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Yusuke Ijima, Ryo Masumura, Yusuke Shinohara, "SIMPLEFLAT: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 5664-5668, 2021.
- Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, "Multimodal Attention Fusion for Target Speaker Extraction", in Proc. IEEE Spoken Language Technology Workshop (SLT), pp. 778-784, 2021.
- Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo, "Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 1149-1153, 2021
- Takafumi Moriya, Tomohiro Tanaka, Takanori Ashihara, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Ryo Masumura, Marc Delcroix and Taichi Asami, "Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 1787-1791, 2021.
- Takanori Ashihara, Takafumi Moriya, Makio Kashino, "Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance"", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), 1757-1761, 2021."
- Naohiro Tawara, Atsunori Ogawa, Yuki Kitagishi, Hosana Kamiyama, and Yusuke Ijima, ``Robust Speech-Age Estimation Using Local Maximum Mean Discrepancy Under Mismatched Recording Conditions,'' Proc. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 114--121, Dec. 2021.
- Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Naoko Tanji, Yusuke Ijima, Ryo Masumura, and Hiroshi Saruwatari, ``Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings,'' Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), pp. 211--215, Sept. 2021.
- Kenichi Fujita, Atsushi Ando, and Yusuke Ijima, ``Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis,'' Proc. INTERSPEECH 2021, pp. 3141-3145, Sept. 2021.
- Naoto Kakegawa, Sunao Hara, Masanobu Abe, and Yusuke Ijima, ``Phonetic and prosodic information estimation from texts for genuine Japanese end-to-end text-to-speech,'' Proc. INTERSPEECH 2021, pp. 3606--3610, Sept. 2021.
- Mizuki Nagano, Yusuke Ijima, and Sadao Hiroya, ``Impact of Emotional State on Estimation of Willingness to Buy from Advertising Speech,'' Proc. INTERSPEECH 2021, pp. 2486--2490, Sept. 2021.
- Atsushi Ando, Ryo Masumura, Hiroshi Sato, Takafumi Moriya, Takanori Ashihara, Yusuke Ijima, and Tomoki Toda, ``Speech Emotion Recognition Based on Listener Adaptive Models,'' Proc. ICASSP 2021, pp. 6274--6278, June 2021.
- Takafumi Moriya, Takanori Ashihara, Tomohiro Tanaka, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Yusuke Ijima, Ryo Masumura, and Yusuke Shinohara, ``Simpleflat: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition,'' Proc. ICASSP 2021, pp. 5664--5668, June 2021.
表彰
- 粟屋潔学術奨励賞(第50回 2021年春季研究発表会)森谷 崇史 , "CTC-Transformer音声認識における自己知識蒸留の検討"
2020
論文
- Hosana Kamiyama, Atsushi Ando, Ryo Masumura, Satoshi Kobashikawa, Yushi Aono, "Likability estimation for contact center agents by selecting annotators based on binomial distribution", Acoustical Science and Technology, Acousitc Letters, vol.41, no.6, pp.826-828, 2020.
- Atsushi Ando, Ryo Masumura, Hosana Kamiyama, Satoshi Kobashikawa, Yushi Aono, Tomoki Toda, "Customer Satisfaction Estimation in Contact Center Calls Based on a Hierarchical Multi-Task Model", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.28, pp.715-728, 2020.
- 北条伸克, 井島勇祐, 杉山弘晃, ``音声対話システムにおける音声合成のための対話行為情報を利用した文末音調ラベル推定,'' 人工知能学会論文誌, Vol. 35, No. 4, pp. A-J5_1-11, July 2020.
- 北条伸克, 井島勇祐, 杉山弘晃, 宮崎昇, 川西隆仁, 柏野邦夫, ``対話行為情報を表現可能なDNN音声合成と発語内行為自然性に関する評価,'' 人工知能学会論文誌, Vol. 35, No. 2, pp. A-J81_1-17, Mar. 2020.
国際会議
- Takafumi Moriya, Hiroshi Sato, Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, "Distilling Attention Weights for CTC-based ASR Systems", In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 6889-6893, 2020.
- Takafumi Moriya, Tsubasa Ochiai, Shigeki Karita, Hiroshi Sato,Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, Marc Delcroix, "Self-Distillation for Improving CTC-Transformer-based ASR Systems ", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.546-550, 2020.
- Yuki Kitagishi, Hosana Kamiyama, Atsushi Ando, Naohiro Tawara, Takeshi Mori, and Satoshi Kobashikawa, "Speaker age estimation using age-dependent insensitive loss", In Proc. APSIPA, pp. 319-324, Dec. 2020.
- Hiroki Kanagawa and Yusuke Ijima, ``Lightweight LPCNet-based Neural Vocoder with Tensor Decomposition,'' Proc. Interspeech 2020, pp. 205-209, Oct. 2020.
- Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, and Hiroshi Saruwatari, ``Investigating Effective Additional Contextual Factors in DNN-based Spontaneous Speech Synthesis,'' Proc. Interspeech 2020, pp. 3201-3205, Oct. 2020.
- Nobukatsu Hojo, Yusuke Ijima, Hiroaki Sugiyama, Noboru Miyazaki, Takahito Kawanishi, and Kunio Kashino, ``DNN-based Speech Synthesis considering Dialogue-Act Information and its Evaluation with Respect to Illocutionary Act Naturalness,'' Proc. Speech Prosody 2020, Tokyo, Japan, May 2020.
- Takuya Ozuru, Yusuke Ijima, Daisuke Saito and Nobuaki Minematsu, ``Are you professional?: Analysis of prosodic features between a newscaster and amateur speakers through partial substitution by DNN-TTS,'' Proc. Speech Prosody 2020, Tokyo, Japan, May 2020.
- Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, and Hiroshi Saruwatari, ``DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus,'' Proc. LREC 2020, pp. 6438-6443, May 2020.
表彰
- 音声研究会研究奨励賞(2020年度)佐藤 宏, "オーディオビジュアル目的話者抽出の実環境動作に向けたattention機構の検討"
2019
国際会議
- Hosana Kamiyama, Atsushi Ando, Ryo Masumura, Satoshi Kobashikawa, Yushi Aono, "Likability Estimation of Call-center Agents by Suppressing Annotator Variability", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.911-916, 2019.
- Hosana Kamiyama, Atsushi Ando, Ryo Masumura, Satoshi Kobashikawa, Yushi Aono, "Urgent Voicemail Detection Focused on Long-term Temporal Variation", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.917-921, 2019.
- Ryo Masumura, Kiyoaki Matsui, Yuma Koizumi, Takaaki Fukutomi, Takanobu Oba, Yushi Aono, "Context-Aware Neural Voice Activity Detection Using Auxiliary Networks for Phoneme Recognition, Speech Enhancement and Acoustic Scene Classification ", In Proc. European Signal Processing Conference (EUSIPCO), 2019.
- Ryo Masumura, Tomohiro Tanaka, Atsushi Ando, Hosana Kamiyama,Takanobu Oba, Satoshi Kobashikawa, Yushi Aono,"Improving Conversation-Context Language Models with Multiple Spoken Language Understanding Models", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.834-838, 2019.
- Ryo Masumura, Hiroshi Sato, Tomohiro Tanaka, Takafumi Moriya, Yusuke Ijima, Takanobu Oba, "End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.1606-1610, 2019.
- Ryo Masumura, Yusuke Ijima, Satoshi Kobashikawa, Takanobu Oba, Yushi Aono, "Can We Simulate Generative Process of Acoustic Modeling Data? Towards Data Restoration for Acoustic Modeling", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.655-661, 2019.
- Takafumi Moriya, Jian Wang, Tomohiro Tanaka, Ryo Masumura, Yusuke Shinohara, Yoshikazu Yamaguchi, Yushi Aono, "Joint Maximization Decoder with Neural Converters for Fully Neural Network-based Japanese Speech Recognition", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.4410-4414, 2019.
- Tomohiro Tanaka, Ryo Masumura, Takafumi Moriya, Takanobu Oba, Yushi Aono, "A Joint End-to-End and DNN-HMM Hybrid Automatic Speech Recognition System with Transferring Shared Knowledge", In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp.2210-2214, 2019.
- Tomohiro Tanaka, Ryo Masumura, Takafumi Moriya, Takanobu Oba, Yushi Aono, "Disfluency Detection Based on Speech-Aware Token-by-Token Sequence Labeling with BLSTM-CRFs and Attention Mechanisms", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.1009-1013, 2019.
- Hiroshi Sato, Takafumi Moriya, Yusuke Shinohara, Ryo Masumura, Takaaki Fukutomi, Kiyoaki Matsui, Takanori Ashihara, Yoshikazu Yamaguchi, Yushi Aono , "Revisiting Dynamic Adjustment of Language Model Scaling Factor for Automatic Speech Recognition", In Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.186-191, 2019.
- Taiki Nakamura, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, and Hiroshi Saruwatari, ``V2S attack: building DNN-based voice conversion from automatic speaker verification,'' Proc. 10th ISCA Speech Synthesis Workshop. pp. 161--165, Vienna, Austria, Sept. 2019.
- Hiroki Kanagawa and Yusuke Ijima, ``Multi-Speaker Modeling for DNN-based Speech Synthesis Incorporating Generative Adversarial Networks,'' Proc. 10th ISCA Speech Synthesis Workshop. pp. 40--44, Vienna, Austria, Sept. 2019.
- Ryo Masumura, Hiroshi Sato, Tomohiro Tanaka, Takafumi Moriya, Yusuke Ijima, and Takanobu Oba, ``End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders,'' Proc. Interspeech 2019, pp. 1606-1610, Graz, Austria, Sept. 2019.