ICCV2023

ObjectFusion: Multi-modal 3D Object Detection with Object-Centric Fusion

Qi Cai, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei

被引用 71 次

摘要

Recent progress on multi-modal 3D object detection has featured BEV (Bird-Eye-View) based fusion, which effectively unifies both LiDAR point clouds and camera images in a shared BEV space. Nevertheless, it is not trivial to perform camera-to-BEV transformation due to the inherently ambiguous depth estimation of each pixel, resulting in spatial misalignment between these two multi-modal features. Moreover, such transformation also inevitably leads to projection distortion of camera image features in BEV space. In this paper, we propose a novel Object-centric Fusion (ObjectFusion) paradigm, which completely gets rid of camera-to-BEV transformation during fusion to align object-centric features across different modalities for 3D object detection. ObjectFusion first learns three kinds of modality-specific feature maps (i.e., voxel, BEV, and image features) from LiDAR point clouds and its BEV projections, camera images. Then a set of 3D object proposals are produced from the BEV features via a heatmap-based proposal generator. Next, the 3D object proposals are reprojected back to voxel, BEV, and image spaces. We leverage voxel and RoI pooling to generate spatially aligned object-centric features for each modality. All the object-centric features of three modalities are further fused at object level, which is finally fed into the detection heads. Extensive experiments on nuScenes dataset demonstrate the superiority of our Ob-jectFusion, by achieving 69.8% mAP on nuScenes validation set and improving BEVFusion by 1.3%.