Unsupervised Learning of 3D Representations from 2D images

Abstract

Given a photo, humans can estimate the depth and which parts are out of focus on the basis of their experience and knowledge. However, this is not easy for typical computers because they do not have such experience and knowledge. To overcome this limitation, we developed a deep generative model that uses a camera aperture rendering mechanism, making it possible to learn unknown depth and bokeh effects from only a set of ordinary photos, such as images on the web.

Contributions

To train a 3D estimator that can estimate depth and bokeh effects from a 2D image, it was necessary to collect 3D information with specific equipment such as depth or stereo cameras. In contrast, the data required for training our deep generative model are only a set of ordinary photos, such as images on the web, and neither depth nor bokeh effect information is needed. To learn 3D representations under this challenging condition, our model incorporates a camera-aperture rendering mechanism, making it possible to learn not only the distribution of images given as training data but also the distribution of 3D information hidden in the images. Therefore, the model enables the generation of not only an image but also the corresponding depth while controlling bokeh effects.

Experimental results

Through our experiments, we confirmed that our model can flexibly control the bokeh strength and focus distance while estimating depth. In particular, the model can adjust the bokeh strength by changing the aperture size and can change the object in focus by varying the focus distance.

Future work

Since we live in a 3D world, it is important to develop a computer that can understand the 3D world to make it highly compatible with humans. We expect that our model will reduce the cost of collecting data necessary for developing such a computer and cultivate a new field of 3D understanding.

Publications

T. Kaneko, “Unsupervised learning of depth and depth-of-field effect from natural images with aperture rendering generative adversarial networks,” in Proc. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2021), pp. 15679–15688, June 2021.
https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/ar-gan/
T. Kaneko, “AR-NeRF: Unsupervised learning of depth and defocus effects from natural images with aperture rendering neural radiance fields,” in Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2022), pp. 18387–18397, June 2022.
https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/ar-nerf/

Contact

Takuhiro Kaneko
Recognition Research Group, Media Information Laboratory, NTT Communication Science Laboratories