ICLR2026

RAR: Reversing Visual Attention Re-Sinking for Unlocking Potential in Multimodal Large Language Models

Zhehan Kan, Xin Li, Yanlin Liu, Xiaochen Yang, Xinghua Jiang, Yinsong Liu, Deqiang Jiang, Xing Sun, Qingmin Liao, Wenming Yang

出版方

摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet they frequently exhibit suboptimal output layers, where intermediate decoder layers outperform the final ones, signaling underutilized model capacity. In this work, we delve into the root causes and attribute this issue to the Visual Attention Re-sinking phenomenon, precipitated by attention gradient sparsity driven by textual supervision dominance. This degradation causes attention heads to evolve into sink heads that prioritize lowsemantic backgrounds, thereby disrupting modality fusion, neglecting visual information, and biasing outputs toward textual priors, ultimately impairing model performance. To mitigate this, we introduce a parameter-free Sink Attention Dynamic Sparsification (SADS) framework that dynamically preserves all vision heads, ensuring focused attention on semantically salient regions, while retaining only a minimal subset of sink heads, including a designated shared head to safeguard essential global and contextual information. Integrated into diverse MLLMs, our framework yields substantial performance gains across 20 benchmarks spanning five task categories (visual grounding, general VQA, OCR-related VQA, vision-centric tasks, and visual hallucination mitigation) surpassing supervised fine-tuning while boosting inference speed by 10.3%. This approach offers a novel avenue for maximizing MLLMs capabilities.