NeurIPS2023

NU-MCC: Multiview Compressive Coding with Neighborhood Decoder and Repulsive UDF

Stefan Lionar, Xiangyu Xu, Min Lin, Gim Hee Lee

11 citations

Abstract

Remarkable progress has been made in 3D reconstruction from single-view RGB-D inputs. MCC is the current state-of-the-art method in this field, which achieves unprecedented success by combining vision Transformers with large-scale training. However, we identified two key limitations of MCC: 1) The Transformer decoder is inefficient in handling large number of query points; 2) The 3D representation struggles to recover high-fidelity details. In this paper, we propose a new approach called NU-MCC that addresses these limitations. NU-MCC includes two key innovations: a Neighborhood decoder and a Repulsive Unsigned Distance Function (Repulsive UDF). First, our Neighborhood decoder introduces center points as an efficient proxy of input visual features, allowing each query point to only attend to a small neighborhood. This design not only results in much faster inference speed but also enables the exploitation of finer-scale visual features for improved recovery of 3D textures. Second, our Repulsive UDF is a novel alternative to the occupancy field used in MCC, significantly improving the quality of 3D object reconstruction. Compared to standard UDFs that suffer from holes in results, our proposed Repulsive UDF can achieve more complete surface reconstruction. Experimental results demonstrate that NU-MCC is able to learn a strong 3D representation, significantly advancing the state of the art in single-view 3D reconstruction. Particularly, it outperforms MCC by 9.7% in terms of the F1-score on the CO3D-v2 dataset with more than 5× faster running speed. Introduction 3D reconstruction from a single-view RGBD input is a fundamental problem in computer vision with applications in robotics [1, 2] and VR/AR [3] . The state-of-the-art approach for this task is MCC [4] , which leverages large-scale multi-view images [5] to develop a scalable model for 3D reconstruction from a single RGB-D image. MCC utilizes the depths and 3D point clouds for supervision obtained using the COLMAP framework [6, 7] . By combining Vision Transformer [8, 9] with large-scale training, MCC can learn a generalizable textured single-view 3D reconstruction model that generalizes to diverse zero-shot settings. However, we have identified two key limitations of MCC that affect its reconstruction quality and model efficiency. First, the Transformer decoder in MCC directly takes in the 3D locations of query points to predict their respective occupancy and color. Due to the quadratic complexity of Transformers, this approach incurs a high computational cost when the number of query points is large, as is often the case for detailed 3D reconstruction. Second, MCC uses the occupancy field as the underlying 3D representation, which hinders the recovery of high-fidelity geometry and texture details. This can be observed in Figure 1 , where the reconstruction lacks intricate details.