Abstract
Recently, detection and classification of various sounds has attracted many researchers attention. We propose an
audio captioning system that can describe various non-speech audio signals in the form of natural language. Most
existing audio captioning systems have mainly focused on “what the individual sound is,” or classifying sounds to
find object labels or types. In contrast, the proposed system generates (1) an onomatopoeia, i.e. a verbal
simulation of non-speech sounds, and (2) an sentence describing sounds, given an audio signal as an input. This
allows the description to include more information, such as how the sound sounds and how the tone or volume
changes over time. Our approach also enables directly measuring the distance between a sentence and an audio
sample. The potential applications include sound effect search systems that can accept detailed sentence
queries, audio captioning systems for videos, and AI systems that can hear and represent sounds as humans do.
Kunio Kashino, Media Information Laboratory
Email: