Given a photo, humans can estimate the depth and which parts are out of focus on the basis of their experience and knowledge. However, this is not easy for typical computers because they do not have such experience and knowledge. To overcome this limitation, we developed a deep generative model that uses a camera aperture rendering mechanism, making it possible to learn unknown depth and bokeh effects from only a set of ordinary photos, such as images on the web.
To train a 3D estimator that can estimate depth and bokeh effects from a 2D image, it was necessary to collect 3D information with specific equipment such as depth or stereo cameras. In contrast, the data required for training our deep generative model are only a set of ordinary photos, such as images on the web, and neither depth nor bokeh effect information is needed. To learn 3D representations under this challenging condition, our model incorporates a camera-aperture rendering mechanism, making it possible to learn not only the distribution of images given as training data but also the distribution of 3D information hidden in the images. Therefore, the model enables the generation of not only an image but also the corresponding depth while controlling bokeh effects.
Through our experiments, we confirmed that our model can flexibly control the bokeh strength and focus distance while estimating depth. In particular, the model can adjust the bokeh strength by changing the aperture size and can change the object in focus by varying the focus distance.
Since we live in a 3D world, it is important to develop a computer that can understand the 3D world to make it highly compatible with humans. We expect that our model will reduce the cost of collecting data necessary for developing such a computer and cultivate a new field of 3D understanding.
Recognition Research Group, Media Information Laboratory, NTT Communication Science Laboratories