CVPR2025

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Katrin Renz, Long Chen, Elahe Arani, Oleg Sinavski

Abstract

Instruction VQA Driving Commentary Driving mode Figure 1 . Overview: SimLingo is a vision-language-action model unifying the tasks of autonomous driving, vision-language understanding and language-action alignment. It is state of the art on the official CARLA Leaderboard 2.0 and Bench2Drive using only camera images. We introduce the task of Action Dreaming, a form of instruction following, to improve the alignment of language and action.