Exhibition Program

Science of Media Information


Neural audio captioning

- Generating text describing non-speech audio -


Recently, detection and classification of various sounds has attracted many researchers attention. We propose an audio captioning system that can describe various non-speech audio signals in the form of natural language. Most existing audio captioning systems have mainly focused on “what the individual sound is,” or classifying sounds to find object labels or types. In contrast, the proposed system generates (1) an onomatopoeia, i.e. a verbal simulation of non-speech sounds, and (2) an sentence describing sounds, given an audio signal as an input. This allows the description to include more information, such as how the sound sounds and how the tone or volume changes over time. Our approach also enables directly measuring the distance between a sentence and an audio sample. The potential applications include sound effect search systems that can accept detailed sentence queries, audio captioning systems for videos, and AI systems that can hear and represent sounds as humans do.


  • [1] Shota Ikawa, Kunio Kashino, “Generating sound words from audio signals of acoustic events with sequence-to-sequence model,” In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), April 2018.
  • [2] Shota Ikawa, Kunio Kashino, “Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds,” In Proc. Detection and Classification of Acoustic Scenes and Events (DCASE 2018), November 2018.




Kunio Kashino, Media Information Laboratory