CVPR2025

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang

摘要

We introduce DyCoke (dynamic compression of tokens), a training-free token compression method for fast video large language models. The key innovation of DyCoke over its predecessors is to dynamically remove redundant tokens during the decoding stage, squeezing both the temporal (video frames) and spatial redundancy in visual tokens. Right: Efficiency and performance comparison of various training-free token pruning methods on MVBench [23] with LLaVA-OV-7B [18]. DyCoke surpasses the SoTA counterparts (PruMerge [39], FastV [3]), with 1.5× inference speedup and a 1.4× reduction in memory usage relative to the baseline, while simultaneously enhancing performance.