ICLR2025
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang
Abstract
LVLMs consistently give attention to irrelevant visual tokens. Q: ① Why does this phenomenon occur? ② How does this phenomenon affect the performance? B. Preliminaries "Attention sink" has been observed in LLMs and ViTs. LLMs ViTs ⒜ Masking visual sink tokens has little impact on performance. ⒝ Visual sink tokens have extremely low attention contributions. ⒞ Visual sink tokens are mostly located in the background. C. Analysis ① Irrelevant visual tokens exhibit massive activation. Let's call these tokens "visual sink tokens". ② Visual sink tokens are less meaningful. ⒜ ⒝ ⒞ Attention sink occurs… ① by massive activation in specific dims (e.g., 1415/2533 @ LLaMA-2-7B) ② at less meaningful tokens (e.g., <BOS>, ".", ":" in LLMs / background in ViTs) Can we recycle surplus attention in visual attention sink? Attention weights in sink tokens = FREE "attention budget"