ICLR2026
SAM-Veteran: An MLLM-Based Human-like SAM Agent for Reasoning Segmentation
Tianyuan Du, Haopeng Li, Zhen Fan, Jiarui Zhang, Panwang Pan, Yang Zhang
Abstract
Significant progress has been made in reasoning segmentation by combining multi-modal large language models (MLLMs) with the Segment Anything Model (SAM): the former excel in reasoning and vision-language alignment, while the latter offers powerful pixel-level understanding. However, current paradigms fall short in exploiting SAM's strengths, especially the ability to support iterative mask refinement by interactive segmentation, a process that human users can naturally perform. To bridge this gap, we introduce SAM-Veteran, an experienced mask-aware SAM agent capable of emulating human interaction with SAM via a reasoning-driven segmentation workflow that integrates (i) generating bounding boxes given image-query pairs for SAM input, (ii) proposing refinement points based on SAM-generated masks, and (iii) adaptively terminating the process. Aiming for this goal, we propose a multi-task reinforcement learning framework based on Group Relative Policy Optimization (GRPO), which enhances the MLLM's abilities in textual grounding and mask comprehension. Furthermore, we introduce a dynamic sampling strategy tailored for generating both boxes and points to stabilize training. Extensive experiments across diverse datasets show that SAM-Veteran achieves human-like interaction with SAM and establishes new state-of-the-art performance on both in-domain and out-of-domain benchmarks. Recent studies have investigated two primary MLLM-based paradigms for this task: (1) Supervised Fine-Tuning (SFT), where MLLMs generate special tokens that control a learnable segmentation head or decoder, thereby enabling end-to-end training as a unified model (Yan et al., 2024; Lai et al., 2024; Yan et al., 2025) ; and (2) Reinforcement Learning (RL), where MLLMs are optimized with reward signals for generating boxes and/or points that are then fed into Segment Anything Model (SAM) (Kirillov et al., 2023) to produce the final segmentation (Liu et al., 2025b; Huang et al., 2025) . While SFT-based methods effectively incorporate the reasoning capability of MLLMs into * Equal contribution. † Corresponding Author.