ICLR2026

Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei

7 citations

Abstract

Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose $\textbf{PruneSID}$ , a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principle Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, $\textbf{PruneSID}$ incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving $\textbf{96.3}$ % accuracy on LLaVA-1.5 with only $\textbf{11.1}$ % token retention, and $\textbf{92.8}$ % accuracy at extreme compression rates ( $\textbf{5.6}$ %) on LLaVA-NeXT, outperforming prior methods by $\textbf{2.5}$ % with $\textbf{7.8}$ x faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility.