AAAI2026
VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning
Siran Chen, Boyu Chen, Yuxiao Luo, Chenyun Yu, Yi Ouyang, Lei Cheng, Chengxiang Zhuo, Zang Li, Yali Wang
Abstract
Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content understanding. This frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0% improvement in NDCG@10 on the MicroLens-100k dataset, while the US Agent shows approximately 45.0% higher accuracy in user decision simulation compared to state-of-the-art baselines. Preprint. Under review. This is a video titled "in the middle of the night my pig addiction again committed # filial police Art # food # drama", will the user like the video? MLLM: a man wearing a black hoodie with yellow letter…, maybe related to Chinese content or culture. SFT: <answer> No </answer>. IP Agent: a humorous and exaggerated drama shows three men in black are seeking and arresting people with pig addiction .... #fictional drama #humorous This is a video titled "The situation suddenly changed", will the user like the video? SFT Ours US Agent: <think> based on the use's historic…, shows a positive attitude…may like pleasant, relaxing, humorous content </think> <answer> Yes </answer>. SFT Ours US Agent: <think> based on the use's historic…, the user loves games, sports and TV shows, not shows interest in politic topics </think> <answer> No </answer>. MLLM: two men wearing suit, the old man in the center is smiling and seems friendly… SFT: <answer> Yes </answer>.