Video and audio media processing technology

Generating 3D Acoustic Spaces Using 360-Degree Video

More and more users are experiencing highly immersive visual spaces such as 360-degree video. To present sound that responds to the movement of the user's eye, we needed to use a microphone array with four or more channels to capture spatial sound. Preparing the special equipment was difficult.

We are researching a technique to restore spatial acoustics (a 4-channel acoustic signal in a format called Ambisonics) using a monaural acoustic signal and a 360-degree video image. We are also researching techniques to improve the accuracy of sound source separation.
Acoustic reproduction techniques generally require us to place loudspeakers densely on boundary surfaces such as spheres. These techniques allow us to reproduce the spatial distribution of sound pressure inside an area. We focused on this feature. We constructed a multilayer neural network that takes 360-degree video and monaural audio as inputs and outputs ambisonic coefficients. We need to find out the location of each sound source separately. We reproduce a 3D acoustic space by adjusting the output of N loudspeakers in the area.

Generating 3D Acoustic Spaces

Left: Current image, Center: Correct image, Right: Proposed technology

We have validated our technology using a small amount of data. We intend to conduct quantitative evaluations using large data sets in the future. We will introduce this 3D acoustic space into the ultra-high-reality metaverse.

Harmonious reproduction of real venues and online audiences

More and more users are watching live-streamed events at home. The need for online entertainment is increasing. Typically, when cheering online, there is a discrepancy in response from the audience. Real venues and online audiences have never been in harmony.

We developed a technique to extract cheers using cross-modal sound retrieval technology. We estimate the likelihood of excitement from the video of the audience cheering (waving penlights). We prepare training data that pairs the video of the audience cheering with the sound of the cheers. We will use this data to build a model to estimate the sound from the video.

We extract features related to cooperative behavior from the videos of spectators clapping their hands and waving penlights. These features allow us to detect misalignments between multiple remote audience videos. We can compensate for the misalignment and generate synchronized video.

At the 34th Mynavi Tokyo Girls Collection 2022 SPRING/SUMMER (March 21, 2022), the audience could not cheer due to the Corona disaster. We reproduced pseudo cheers according to the excitement of the online and real audience. We were able to harmonize the real and online audiences. We realized a new experience that brought the real and online audiences together.

new experience that bought the real and online audience