CVPR2024

MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models

Yasiru Ranasinghe, Deepti Hegde, Vishal M. Patel

21 citations

Abstract

3D object detection and pose estimation from a single-view image is challenging due to the high uncertainty caused by the absence of 3D perception. As a solution, recent monocular 3D detection methods leverage additional modalities, such as stereo image pairs and LiDAR point clouds, to enhance image features at the expense of additional annotation costs. We propose using diffusion models to learn effective representations for monoc-ular 3D detection without additional modalities or training data. We present MonoDiff, a novel framework that em-ploys the reverse diffusion process to estimate 3D bounding box and orientation. But, considering the variability in bounding box sizes along different dimensions, it is inef-fective to sample noise from a standard Gaussian distribution. Hence, we adopt a Gaussian mixture model to sam-ple noise during the forward diffusion process and initialize the reverse diffusion process. Furthermore, since the diffusion model generates the 3D parameters for a given object image, we leverage 2D detection information to pro-vide additional supervision by maintaining the correspon-dence between 3D/2D projection. Finally, depending on the signal-to-noise ratio, we incorporate a dynamic weighting scheme to account for the level of uncertainty in the supervision by projection at different timesteps. MonoDiff outperforms current state-of-the-art monocular 3D detection methods on the KITTI and Waymo benchmarks without additional depth priors. MonoDiff project is available at: https://dylran.github.iolmonodiffgithub.io.