ICML2024

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, Aviral Kumar

被引用 163 次

摘要

A broad use case of large language models (LLMs) is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate probable completions for a given prompt, but rather make intelligent decisions over an extended period of multi-turn interaction to accomplish a task (e.g., when interacting with the web, using software tools, or engaging in customer support). Reinforcement learning (RL) provides a general paradigm to address such agent tasks, but current RL methods for LLMs largely focus on single-turn reward maximization. By construction, single-turn RL methods of today cannot actually train LLMs to intelligently seek and incorporate information over multiple turns, perform credit assignment, or reason about their past actions -all of which are critical in agent tasks. This raises the question: how can we design effective and efficient multi-turn RL algorithms for LLMs? In this paper, we propose an algorithmic framework for developing multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility of existing single-turn RL methods for LLMs (e.g., proximal policy optimization), while accommodating multiple turns, long horizons, and delayed rewards effectively. To do this, our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel: a high-level off-policy RL algorithm that trains a value function to aggregate reward over utterances, and a low-level RL algorithm that utilizes this high-level value function (in place of a reward model used in single-turn RL) to train a token-by-token policy within each utterance or turn. This hierarchical approach prescribed by our framework, Actor-Critic Framework with a Hierarchical Structure (ArCHer), can also give rise to a number of other RL approaches. Empirically, we find that ArCHer significantly improves efficiency and performance on multi-turn tasks, attaining sample efficiency of about 100x over existing on-policy methods, while also benefitting favorably from scaling up model capacity (upto the 7 billion scale that we could test on in our experiments). Project page can be found in https://yifeizhou02.github.io/archer.io/ and code can be found in https://github.com/YifeiZhou02/ArCHer .