AAAI2025

Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

Duc-Hai Pham, Duc Dung Nguyen, Anh Pham, Tuan Ho, Phong Nguyen, Khoi Nguyen, Rang Nguyen

被引用 6 次

摘要

Accurate prediction of 3D semantic occupancy from 2D visual images is crucial for enabling autonomous agents to understand their surroundings for planning and navigation. State-of-the-art methods typically rely on fully supervised approaches, requiring large labeled datasets obtained through expensive LiDAR sensors and meticulous voxel-wise annotation by human experts. The resource-intensive nature of this annotation process significantly limits the scalability and application of these methods. To address this challenge, we propose a novel semi-supervised framework that reduces reliance on densely annotated data. Our approach leverages 2D foundation models to extract essential 3D scene geometry and semantic cues, enabling a more efficient training process. The proposed framework has two key advantages: (1) Generalizability, as it is compatible with various 3D semantic scene completion methods, including 2D-3D lifting and 3D-2D transformer techniques; and (2) Effectiveness, as demonstrated by experiments on the SemanticKITTI and NYUv2 datasets, where our method achieves up to 85% of the fully supervised performance using only 10% of the labeled data. This approach not only reduces the cost of data annotation but also highlights its potential for broader adoption in visionbased systems for 3D semantic occupancy prediction.