Is AI able to solve problem which humans can't naturally?
Click here for project progress
Click here for our problem statement and plan for COVID-19 QA system using BERT
Depth estimation from images is present in nature. How our eyes are aligned plays an important role in estimating depth. Stereo Vision in nature is often referred to as stereopsis. For predators, the eyes are aligned in such a way that there is a lot of overlap between the left eye image and the right eye image. Based on triangulation, the predators sense the depth of things they see. Whereas in animals that are prey, the eyes are aligned in such a way that they have a greater field of view and less overlapping between the left and right eye images, this is called monocular vision where most of the time the two eyes are on the opposites of the head. In this case, depth perception is limited. Even if we perform an experiment where we close one of the eyes, we will find it hard to perceive depth.
Over time, a variety of deep neural networks have been able to estimate scene depth using a single image, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), variational auto-encoders (VAEs), generative adversarial networks (GANs) and most recently Vision Trans- formers (ViT).
Depth estimation is calculating the depth of each pixel value in an image relative to the camera. This task is key in applications such as inferring scene geometry from 2D images, autonomous driving, human pose estimation and AR. Many of the early algorithms for depth estimation have utilized geometry based methods on stereo images to obtain successful results. Structure from Motion [1] is one such method where 3D structures are reconstructed using 2D image sequences. These geometric methods rely on image pairs or sequences for measuring depth values.
Monocular depth estimation has gained more attention due to the limitations of depth estimation from stereo images such as occlusions and increased applications of monocular cameras due to its low cost. With the rapid growth in Deep Learning, many deep neural networks [2] have been effectively utilized for monocular depth estimation. Eigen et. al [3] first used CNNs for monocular depth estimation. Their network had two stacks, a coarse-scale network that predicts the depth of the scene at a global level which is then refined within local regions by a fine-scale network. More recently, Guizilini et. al [4] proposed a model, PackNet that introduced 3D packing and unpacking blocks to preserve and recover important spatial information for depth estimation.
Zhao et al [5] propose a framework integrating convolutions and vision transformer blocks with a claim that the performance of CNN based frameworks is restricted by the limited receptive field of CNNs. The proposed Monocular Vision Transformer (MonoViT) framework has a DepthNet and a PoseNet for depth prediction and pose estimation respectively and trained through image reconstruction losses.
What we want at the end of the project is to get a sense of how deep neural networks can estimate scene depth using a single image. We propose to reproduce the works from the paper that uses a vision transformer [5] and build a solution based on CNN as well. Self-attention leads to a different means of perception within the algorithms. In CNN, we start off being very local and slowly get a global perspective. A CNN recognizes an image pixel by pixel, identifying features like corners or lines by building its way up from the local to the global. But in transformers, with self-attention, even the very first layer of information processing makes connections between distant image locations.
We would visualize how the features get generated at different levels of the two models by tapping into intermediate layers. We would also perform multiple experiments to understand the effects of changing the size, learning rate, and other hyperparameters of the model.
Datasets that will be used to train and perform the experiments would be KITTI and NYU-Depth V2. At the end of the experiments, we will build a pipeline to test real-time depth estimation from the live stream of a web camera.
[1] U. R. Dhond, J. K. Aggarwal, Structure from stereo-a review, in IEEE Transactions on Systems, Man, and Cybernetics, 1989
[2] Zhao, C., Sun, Q., Zhang, C., Tang, Y., & Qian, F, Monocular depth estimation based on deep learning: An overview, in Science China Technological Sciences, 2020
[3] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS, 2014
[4] Guizilini, V.C., Ambrus, R., Pillai, S., & Gaidon, A., 3D Packing for Self-Supervised Monocular Depth Estimation, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
[5] Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., & Mattoccia, S., MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer. ArXiv, abs/2208.03543, 2022