ICLR2026

VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao PENG, Shu Liu, Bei Yu, Jiaya Jia

被引用 8 次

摘要

Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing a unified reward mechanism and multi-object cognitive learning strategies, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks within a unified model. VisionReasoner generates a structured reasoning process before delivering the desired outputs responding to user queries. Human evaluation reveals the reasoning process of VisionReasoner is faithful and reliable even without annotated reasoning train data. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that Vision-Reasoner achieves superior performance as a unified model, outperforming the baseline Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 13.2% on CountBench (counting). INTRODUCTION Recent advances in large vision-language models (LVLMs) (Bai et al., 2025; Wang et al., 2024; Google, 2025; OpenAI, 2025) have demonstrated remarkable capabilities in visual conversations. As the field progresses, researchers are increasingly applying LVLMs to a wider range of visual perception tasks, such as visual grounding (Peng et al., 2024) and reasoning segmentation (Lai et al., 2024; Liu et al., 2025a) , often incorporating task-specific modules or techniques. Through an analysis of diverse visual perception tasks, we observe that many can be categorized into three fundamental types: detection (e.g., object detection (Lin et al., 2014 ), visual grounding (Yu et al., 2016)), segmentation (e.g., referring expression segmentation (Yu et al., 2016) , reasoning segmentation (Lai et al., 2024)), and counting (e.g., object counting (Paiss et al., 2023) ). Notably, our analysis reveals that these three task types share a common structure as multi-object cognition problems, suggesting that they can be addressed through a unified framework. Moreover, recent studies have explored the integration of reinforcement learning (RL) into LVLMs (Team, 2025; Liu et al., 2025b;a; Zheng et al., 2025) . Works such as VisualRFT (Liu et al., 2025b) and Seg-Zero (Liu et al., 2025a) demonstrate that RL can enhance reasoning in visual perception tasks. However, these approaches often employ RL in a task-specific manner, training with different data for different tasks, which may limit their scalability and generalizability. Building on these insights, we propose VisionReasoner, a unified framework that addresses diverse visual perception tasks through a shared architecture. The framework's core capabilities, which include advanced reasoning and multi-object cognition, are enabled through RL and a unified reward mechanism. Format rewards, including thinking rewards that promote structured reasoning and non-repeat rewards that prevent redundant reasoning patterns. Accuracy rewards, comprising multiobject IoU rewards and L1 rewards for precise localization, strengthen multi-object cognition. Unlike previous approaches like Kosmos (Peng et al., 2024) that use cross-entropy loss, our RL framework requires optimal prediction-to-ground-truth matching. We address this challenge by implementing an efficient matching pipeline combining the batch computing and the Hungarian algorithm, significantly improving computational efficiency while maintaining matching accuracy.