CVPR2025
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, Qixiang Ye
Abstract
Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and ( 2 ) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Our codes are available at https://github.com/ncTimTang/AKS * Equal contribution. A: The panda is eating food. A: The panda is rolling down. Q: What does the panda do in the video? Uniform Sampling Adaptive Keyframe Sampling MLLM MLLM Q: After entering the museum, where did the short-haired woman in a black coat, black long skirt, and black mask visit first? Q: A white-lettered title says 'How much data do we need?'. The word 'dog' appears on the far right, When the image of a shaking dog appears, what changes occur in the black scene? Q: Two men wearing straw hats and grey clothes stand in a grass field holding long knives. What does the house behind them look like?