ICLR2026

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li

23 citations

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in openvocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling Figure 1: Key insight of Video-STAR. (a) MLLMs + CoT is prone to hallucinations due to overreliance on text-centric reasoning while ignoring visual cues. (b) MLLMs + Tool-Augmented CoT mitigates hallucinations by integrating domain-specific tools to extract visual information. However, both (a) and (b) lack category-specific reasoning capabilities and struggle to distinguish semantically similar or complex actions. (c) Video-STAR enhances reasoning capacity by introducing contextual sub-motion decomposition, which disentangles actions into discriminative motion primitives. This enables fine-grained action discrimination and robust performance in open-vocabulary scenarios.