ICML2025

The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models

Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, Joseph E. Gonzalez

摘要

Function calling, also called tool use, refers to an LLM's ability to invoke external functions, APIs, or user-defined tools-an essential capability for agentic LLM applications. Despite its prominence, there does not exist a standard benchmark to evaluate function calling due to two reasons -the challenging nature of evaluating when a function call is valid, and the challenge of acquiring diverse, real-world functions. We present the Berkeley Function Calling Leaderboard (BFCL), a comprehensive benchmark designed to evaluate function calling in a wide range of real-world settings. The BFCL benchmark evaluates serial and parallel function calls, across various programming languages, using a novel Abstract Syntax Tree (AST) evaluation method that can easily scale to thousands of functions. We construct the benchmark using a combination of expert-curated and user-contributed functions and associated prompts. Finally, BFCL benchmark evaluates the ability of models to abstain and reason in a stateful multistep agentic setting. Evaluating a wide range of models, we observe that while state-of-the-art LLMs excel at single-turn calls, memory, dynamic decisionmaking, and long-horizon reasoning remain open challenges. Since its preview, BFCL has become the defacto standard for evaluating function-calls, and can be accessed at https://gorilla. cs.berkeley.edu/leaderboard.html