CVPR2025

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

Abstract

Evidence: The <obj_start> man <obj_end> <box_start> [[575, 513, 544, 972]] ... with <obj_start> pink pills <obj_end> <box_start> [[355, 443, 33, 61]] <box_end> later in the images.