ICML2025

TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, Alois Knoll

Abstract

Figure 1 : TUMTraf VideoQA introduces a comprehensive benchmark for video-level traffic scene understanding. Our baseline model, TraffiX-Qwen, is capable of solving multiple tasks, including video QA, spatio-temporal grounding, and referred object captioning, within a unified model. In our approach, the spatio-temporal location of objects is represented as tuples (c, f n, x, y), where c serves as a unique object identifier, f n denotes the normalized frame timestamp, and (x, y) denote the center of the object in the image, normalized with respect to the image dimensions.