Human pose estimation with acoustic signals



We developed a method for estimating 3D human poses without using a visible light camera by non-invasive active acoustic information measurement. A speaker and microphone are placed between a person to measure their 3D poses from minute changes in the sound field when the sound emitted from the speaker is interrupted by the person.


It is common to estimate human poses using video clips captured with a camera. However, it is difficult to do this in environments where visible light is not strong enough, such as at night. There are also privacy issues.
To avoid privacy problems, radio waves have been used [Zhao+ CVPR2018], but this method cannot be used in environments where radio waves are restricted, such as in hospitals or on airplanes.
The proposed method can be used even in environments where conventional methods cannot, thus avoiding privacy issues. Furthermore, the user does not need to wear a specific device to estimate poses if a speaker and microphone are available.

Proposed method

With the proposed method, the time stretched pulse (TSP), the frequency of which changes depending on the elapsed time, is emitted from the speaker, and the sound passed through the person is collected by an ambisonic microphone that records spatial sound at 360°. The basic features are then extracted from the collected sound, and 3D pose is estimated using convolutional networks. We also introduce adversarial learning, which makes it difficult to identify the person to be measured, enabling stable pose estimation.

Future work

We aim to develop a technology that can recognize and represent all events in the real world by simultaneously optimizing measurement, signal processing, modeling, and understanding.


  1. Shibata, Kawashima, Isogawa, Irie, Kimura, Aoki, “Listening human behavior: 3D human pose estimation with acoustic signals,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  2. Project page by Keio University


Akisato Kimura
Recognition Research Group, Media Information Laboratory, NTT Communication Science Laboratories

Related Research