ICLR2026

Benchmarking LLM Tool-Use in the Wild

Peijie Yu, Wei Liu, Yifan Yang, Jinjian Li, Zelong Zhang, Xiao Feng, feng zhang

被引用 2 次

摘要

Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild\textbf{wild}, being intricate, messy, and flexible. We identify three key challenges from user behaviour: compositional tasks\textit{compositional tasks} that demand efficient orchestration of tool-call topologies, implicit intent\textit{implicit intent} spread across dialogue turns that require contextual inference, and instruction transition\textit{instruction transition}, which mixes task queries, clarifications, and casual conversation, forcing LLMs to adjust their policies on the fly. Existing benchmarks overlook these behaviors, making the apparent progress of LLMs on tool-use spurious. To address this, we introduce WildToolBench\textbf{\textit{WildToolBench}}, an LLM tool-use benchmark grounded in real-world user behavior patterns. Comprehensive evaluations of 57 LLMs reveal that no model achieves an accuracy of more than 15%, indicating a substantial gap in the robustness of LLMs' agentic ability. Controlled experiments and in-depth analyses further indicate that the real challenge for LLM tool-use lies not in artificially complex tasks, but in the wild nature of user behavior, emphasizing the need to reconsider the interactions among LLMs\textit{LLMs}, users\textit{users}, and tools\textit{tools}.