ICLR2026

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

被引用 13 次

摘要

The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities. Project https://mmr-v.github.io/ Introduction Recent models like OpenAI's o1 [1] and Deepseek-R1 [2] have significantly improved text reasoning ability through reinforcement learning. This has sparked growing interest in multimodal reasoning [3] . Models like o3 and o4-mini [4] have achieved impressive results on image reasoning tasks through tool use, integrating visual information into the reasoning process to enable deep reflection and evidence mining. However, most of these studies focus on images, with limited exploration of more challenging video reasoning tasks. Video naturally involves sequential and richer multimodal information, requiring models to perform reasoning and mine evidence over long-range, multi-frame. Since this capability is essential for real-world applications such as embodied intelligence and intelligent security monitoring [5; 6], it naturally raises an important question: can current MLLMs perform deep multimodal reasoning and mine evidence on complex videos like o3 on image tasks? Preprint. Under review. Inspired by cognitive and psychological theories [9; 10; 11], such as Kahneman's Dual Process Theory [12], we categorize the tasks in MMR-V into implicit reasoning and explicit reasoning. The key distinction lies in whether the question requires reasoning beyond surface-level information to infer underlying implications. Explicit reasoning is defined as questions that can be solved using perceivable information from the video. For example, the task shown in Figure 1 requires noticing the two lighters hidden in the hand. Implicit reasoning requires extracting and interpreting the underlying subtext behind visual information. For example, in the implicit reasoning case shown in Figure 1 , it requires inferring the underlying implication that the girl's room number 7 symbolizes good luck. This is more of an assessment of EQ, testing whether the model can use its deep understanding of the world knowledge to make implicit and subconscious reasoning paths like humans. MMR-V comprises 317 videos and 1257 tasks. The videos span six major categories, with lengths ranging from 7 to 3771 seconds, with an average of 277 seconds. Tasks are further divided into 10 categories and subcategories. Each task is in multiple-choice format with approximately ten options on average. Tasks typically require reasoning over average 12 video frames, covering about 60% of video duration. All questions and correct answers are human-annotated and reviewed. Distractors are generated using a carefully designed annotation strategy (Details in Section 3.2). We evaluated 9 proprietary models and 11 open-source models on MMR-V. The results reveal that even the best-performing model, o4-mini, achieved only 52.5% accuracy, highlighting the significant challenge MMR-V poses to current multimodal large language models. Our key findings are as follows. (1) Multimodal reasoning challenge: Our findings in Section 4.2 show that reasoning enhancement strategies (e.g., CoT and scaling test-time compute) yield limited improvements, indicating that MMR-V presents a greater challenge to current multimodal reasoning models. Further error analysis in Section 4.5 shows that the CoT demanded in multimodal reasoning differs from those in textual reasoning. Current models tend to rely on textual reasoning based on visual