CVPR2023

BEVHeight: A Robust Framework for Vision-based Roadside 3D Object Detection

Lei Yang, Kaicheng Yu, Tao Tang, Jun Li, Kun Yuan, Li Wang, Xinyu Zhang, Peng Chen

Abstract

Figure 1. (a) To produce 3D bounding boxes out of a monocular image, state-of-the-art methods firstly predict the per-pixel depth either explicitly or implicitly to determine the 3D location of foreground objects with the background. However, when we plot the per-pixel depth on the image, we notice that the differences between points on the car roof and surrounding ground quickly shrink when the car moves away from the camera, making it sub-optimal to optimize especially for far objects. (b) On the contrary, we plot the per-pixel height to the ground and observe that such difference remains agnostic regardless of the distance, and visually is superior for the network to detect objects. However, one cannot directly regress the 3D location by solely predicting the height. (c) To this end, we propose a novel framework, BEVHeight to address this issue. Empirical results reveal that our method surpasses the best method by a margin of 4.85% on clean settings and over 26.88% on noisy settings.