CVPR2025

Generating Multimodal Driving Scenes via Next-Scene Prediction

Yanhao Wu, Haoyang Zhang, Tianwei Lin, Lichao Huang, Shujie Luo, Rui Wu, Congpei Qiu, Wei Ke, Tong Zhang

Abstract

Figure 1. An overview of our proposed driving scene generation paradigm -UMGen. Starting from a random initialization (a) UM-Gen generates ego-centric, multimodal scenes frame-by-frame. Each scene encompasses four modalities: ego-vehicle action, map, traffic agent, and image; (b) UMGen offers multiple functions. It can autonomously generate multimodal scene sequences based solely on its own historical context, but also predict other modalities based on input ego-vehicle actions provided by users. Furthermore, UMGen can incorporate user-specified agent actions to create customized scene sequences. In three scene sequences, arranged from top to bottom, we demonstrate the ego vehicle autonomously driving straight through an intersection, executing a user-defined right turn, and encountering scenes where a user-specified white car cuts in front of it. For better visualization, a portion of the map in the last row is zoomed in.