NeurIPS2021
Object DGCNN: 3D Object Detection using Dynamic Graphs
Yue Wang, Justin M. Solomon
被引用 123 次
摘要
3D object detection often involves complicated training and testing pipelines, which require substantial domain knowledge about individual datasets. Inspired by recent non-maximum suppression-free 2D object detection models, we propose a 3D object detection architecture on point clouds. Our method models 3D object detection as message passing on a dynamic graph, generalizing the DGCNN framework to predict a set of objects. In our construction, we remove the necessity of post-processing via object confidence aggregation or non-maximum suppression. To facilitate object detection from sparse point clouds, we also propose a set-to-set distillation approach customized to 3D detection. This approach aligns the outputs of the teacher model and the student model in a permutation-invariant fashion, significantly simplifying knowledge distillation for the 3D detection task. Our method achieves state-of-the-art performance on autonomous driving benchmarks. We also provide abundant analysis of the detection model and distillation framework. Methods for 3D object detection have progressed rapidly, yielding deployable autonomous driving perception systems. Following common practice in 2D vision, 3D object detection often employs complex training and testing pipelines including many post-processing operations to achieve superior performance. These operations are typically non-parallelizable and inefficient even with modern deep learning frameworks, implying a steep trade-off between between efficiency and effectiveness. Modern methods usually employ two stages [1, 2], including a region proposal network [3] that can introduce significant training overhead. Subsequent efforts simplify this pipeline for 3D object detection. PointPillars [4] introduces a one-stage anchor-based design, simplfying training. PillarOD [5] and CenterPoint [6] improve the one-stage model by making per-pillar predictions, that is, one prediction per point on the ground plane. They assign ground-truth bounding boxes to multiple outputs while training to ease optimization. However, they predict redundant boxes, which can overlap in the same positions; extra boxes are eliminated a posteriori using non-maximum suppression (NMS). It remains elusive to remove hand-designed components like NMS in training and testing. We introduce Object DGCNN, a streamlined architecture for 3D object detection from point clouds. Like DETR for 2D object detection [7], we predict a set of bounding boxes from the raw data, enabling an NMS-free pipeline that achieves real-time performance. A critical new component is to treat each object query as a point in a set whose embedding is learned using DGCNN [8]. Compared to the self-attention module [9] in DETR, DGCNN leverages a sparse set of object relations, which reflects the real object distribution in the scene. In contrast to PointPillars [4], PillarOD [5], and CenterPoint [6], our method does not require post-processing. We also provide a knowledge distillation approach customized to 3D object detection. Existing methods typically distill dense feature maps from a teacher model to a student model, whose training objective does not necessarily capture 3D object detection performance [10] . In contrast, we propose set-to-set distillation training that aligns the outputs of the teacher and the student in a permutationinvariant fashion. This process is enabled by the unified Object DGCNN architecture. In addition to 35th Conference on Neural Information Processing Systems (NeurIPS 2021),