AAAI2026

Less Is More: Vision Representation Compression for Efficient Video Generation with Large Language Models

Yucheng Zhou, Jihai Zhang, Guanjie Chen, Jianbing Shen, Yu Cheng

Abstract

Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress . Recently, Large Language Models (LMMs) (Touvron et al. 2023; Liu et al. 2024a; Brown et al. 2020 ) have achieved rapid advancements in Natural Language Processing (NLP), which has become a significant milestone in the AI revolution. This breakthrough has quickly extended to vision modalities: mainstream Vision Language Models (VLMs) (Liu et al. 2023 (Liu et al. , 2024b;; Wang et al. 2024a; Chen et al. 2024c) typically encode visual inputs into tokens and unify multiple modalities within a shared embedding space, demonstrating strong visual-language understanding and generation capabilities in various tasks (Singh et al. 2019; Antol et al. 2015; Hudson and Manning 2019) .