ICLR2025

Understanding Long Videos with Multimodal Language Models

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

摘要

Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLMbased approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video-specific information. Building on this, we explore injecting video-specific information into an LLMbased framework. We utilize off-the-shelf vision tools to extract three objectcentric information modalities from videos, and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establishes its strong generality. Code: github.com/kahnchana/mvu 🖼 Selected Frames 💬 Ques/on 💬 Candidates 🖼 Center Frame LLM VLM 💬 Ques/on 💬 Candidates Just LLM Single Frame VLM 💬 Ques/on 💬 Candidates 🎞 Video Mul3modal Video Understanding (MVU)