Abstract
In order for AI to visually perceive the world around it and to use language to communicate, it needs a dictionary
that associates the visual objects in the world with the spoken words that refers to them. We explore a neural
network models that learn semantic correspondences between the objects and the words given images and
multilingual speech audio captions describing that images. We show that training a trilingual model
simultaneously on English, Hindi, and newly recorded Japanese audio caption data offers improved retrieval
performance over the monolingual models. Further, we demonstrate the trilingual model implicitly learns
meaningful word-level translations based on images. We aim for a future in which AI discovers concepts
autonomously while finding the audio-visual co-occurrences by simply providing media data that exists in the
world such as TV broadcasting. We also consider the application to large-scale archive retrieval and automatic
annotation that involves interactions between different sensory modalities such as vision, audio, and language.
Yasunori Ohishi, Media Recognition Group, Media Information Laboratory
Email: