NeurIPS2023

HotBEV: Hardware-oriented Transformer-based Multi-View 3D Detector for BEV Perception

Peiyan Dong, Zhenglun Kong, Xin Meng, Pinrui Yu, Yifan Gong, Geng Yuan, Hao Tang, Yanzhi Wang

6 citations

Abstract

The bird's-eye-view (BEV) perception plays a critical role in autonomous driving systems, involving the accurate and efficient detection and tracking of objects from a top-down perspective. To achieve real-time decision-making in self-driving scenarios, low-latency computation is essential. While recent approaches to BEV detection have focused on improving detection precision using Lift-Splat-Shoot (LSS)-based or transformer-based schemas, the substantial computational and memory burden of these approaches increases the risk of system crashes when multiple on-vehicle tasks run simultaneously. Unfortunately, there is a dearth of literature on efficient BEV detector paradigms, let alone achieving realistic speedups. Unlike existing works that focus on reducing computation costs, this paper focuses on developing an efficient model design that prioritizes actual on-device latency. To achieve this goal, we propose a latency-aware design methodology that considers key hardware properties, such as memory access cost and degree of parallelism. Given the prevalence of GPUs as the main computation platform for autonomous driving systems, we develop a theoretical latency prediction model and introduce efficient building operators. By leveraging these operators and following an effective local-to-global visual modeling process, we propose a hardware-oriented backbone that is also optimized for strong feature capturing and fusing. Using these insights, we present a new hardware-oriented framework for efficient yet accurate camera-view BEV detectors. Experiments show that HotBEV achieves a 2%⇠23% NDS gain, and 2%⇠7.8% mAP gain with a 1.1⇥⇠3.4⇥ speedups compared to existing works on V100; On multiple GPU devices such as GPU GTX 2080 and the low-end GTX 1080, HotBEV achieves 1.1⇥⇠6.3⇥ faster than others. The code is available at HotBEV. ⇤ Equal Contribution † Corresponding Author 37th Conference on Neural Information Processing Systems (NeurIPS 2023). 𝐼 ! 𝐼 "#$ 𝐼 " (i) CNN-based (ii) Transformer-based (iii) HotBEV GPU Turing RTX2080Ti Energy Concerned GPU Pascal RTX1080Ti Energy Concerned HotBEV Design GPU Volta V100 Energy Concerned (a) (b) In this paper, we propose a hardware-oriented transformer-based framework (HotBEV) for cameraonly 3D detection tasks, which achieves both higher detection precision and remarkable speedup across both high-end GPUs and low-end GPUs (see Figure 1 ). Firstly, we propose a theoretical latency prediction model by considering the algorithm, the scheduling strategy, and the hardware properties. Given a target GPU, we directly use the latency, rather than the computation FLOPs, to guide our algorithm design. Then we perform a latency breakdown of major modules in classic camera-only detectors and figure out that the backbone is usually the speed bottleneck. After benchmarking