ICLR2026
OrchestrationBench: LLM-Driven Agentic Planning and Tool Use in Multi-Domain Scenarios
Aelim Ahn, Sooyeon Lee, Hyosun Wang, Chiwan Park, Daeryong Kim, Jihyeon Roh, Kichang Yang, Wonjun Jang, Hwang Woosung, Min Seok Kim, Jihoon kang
Abstract
Recent progress in Large Language Models (LLMs) has transformed them from text generators into agentic systems capable of multi-step reasoning, structured planning, and tool use. However, existing benchmarks inadequately capture their ability to orchestrate complex workflows across multiple domains under realistic constraints. To address this, we propose OrchestrationBench, a bilingual (English/Korean) benchmark that systematically evaluates (1) workflow-based planning and (2) constraint-aware tool execution. OrchestrationBench spans 17 representative domains with nearly 100 realistic virtual tools, covering scenarios that require sequential/parallel planning and compliance with business constraints. Unlike previous work, it explicitly disentangles planning evaluation from tool execution evaluation, which assesses tool selection, argument extraction, validation, and rejection handling. Constructed entirely through manual annotation with cultural adaptation, the benchmark ensures authenticity, diversity, and freedom from model-specific biases. Extensive experiments across state-of-the-art models show that function calling performance is relatively consistent, whereas planning capabilities exhibit substantial variation across models, emphasizing the need for structured planning evaluation. As a living benchmark, OrchestrationBench is designed to expand toward new domains, tools, and integration enabling rigorous, cross-cultural, and service-ready evaluation of LLM orchestration capabilities. The benchmark is publicly available. INTRODUCTION Large Language Models (LLMs) have advanced rapidly in recent years (OpenAI, 2022; 2023; Deep-Mind, 2025a;b; Anthropic, 2025). Although initially regarded primarily as powerful text generators, recent research has demonstrated their capacity to operate as versatile agents that can interact with external tools (Yao et al., 2023) , perform multi-step reasoning over complex instructions, and assist users in various real-world applications (Shi et al., 2024) . This evolution signifies a paradigm shift: from passive text generation toward the active orchestration of tasks, positioning LLMs as potential service-ready agents in both consumer-facing and enterprise domains. Despite this progress, substantial challenges remain for real-world deployment. In practice, user requests often involve sequences of interdependent subtasks that must be coordinated effectively (Huang et al., 2024; Yao et al., 2024) . These tasks frequently span heterogeneous domains, require integration with external systems, and must adapt to dynamic constraints that evolve during user interaction. However, existing benchmarks operate largely in simplified or domain-isolated settings and thus do not capture the orchestration capabilities required for service-ready LLMs (Zhong et al., 2025; Mialon et al., 2023) .