ACL2024

Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin

Abstract

Large language models (LLMs) have emerged as the core controller for various autonomous agent systems. In this work, we introduce ETO, a method aimed at enhancing the capabilities of open-source LLM agents. Unlike previous work that solely trains on success expert trajectories, our approach enables agents to learn from exploration failures, leading to improved performance through an iterative explorationtraining framework. During the exploration phase, the agent explores the environment, collecting failure trajectories to construct contrastive trajectory pairs. In the training phase, the agent leverages the trajectory contrastive information to update its policy. This iterative process of exploration and training facilitates further improvement for the agents. Experiments on three agent datasets show our method consistently outperforms baselines by more than 5% in final rewards. Moreover, analysis of task-solving efficiency and the potential in scenarios without expert trajectory further highlights the effectiveness of our method. 1 040 pectations (Zhou et al., 2023; Mialon et al., 2023) 041 To enhance the capabilities of LLM agents, one 042 effective approach is through imitation learning. 043 For example, behavioral cloning (BC) (Pomerleau, 044 1991) offers a straightforward method to acquire 045 a policy by supervised learning on observation-046 action pairs from gold expert trajectories. Recently, 047 there have been attempts (Chen et al., 2023; Zeng 048 et al., 2023; Yin et al., 2023) to apply BC to open-049 source LLM-based agents by directly performing 050 supervised fine-tuning (SFT) on expert trajectories. 051 Taking a step further, Aksitov et al. (2023) refine 052 the agent through iterative BC on success trajecto-053 ries generated by the previous policy. 054 Existing research primarily concentrates on imi-055 tation learning from successful expert trajectories. 056 However, relying solely on expert demonstrations 057 130 The agent task with environment feedback can be 131 formalized as a partially observable Markov deci-132 sion process (POMDP) (U, S, A, O, T , R) with in-133 struction space U, state space S, action space A, ob-134 servation space O, transition function T : S ×A → 135 S, and reward function R : S × A → [0, 1]. Note 136 that in our LLM-based agent scenario, U, A, O are 137 subsets of natural language space. 138 Given a task instruction u ∈ U, the LLM 139 agent with parameter θ generates the action a 1 ∼ 140 π θ (•|u) ∈ A according to its policy π θ . The action 141 incurs a change in the latent state space s t ∈ S, and 142 an execution feedback as observation o t ∈ O. Then 143 the agent generates the corresponding action in the 144 145 The interaction loop repeats until the task com-146 pletes or exceeds the maximum steps, and the tra-147 jectory is denoted as: (1) where n is the trajectory length. Finally, the final 151 reward r(u, e) ∈ [0, 1] is computed, with 1 repre-152 senting successful task completion. 153 3 Method 154 Our method, ETO, starts by training a base agent 155 through behavioral cloning. Based on the base 156 agent, our framework continually enhanced the pol-157 icy from trial and error in an iterative manner. 158 3.1 Behavioral Cloning 159 Behavioral cloning (BC) has demonstrated promis-160 ing results through supervised fine-tuning on the 161 expert interaction trajectory data, serving as a solid 162 starting point for building a powerful agent. In this 163 work, we employ ReAct-style (Yao et al., 2022b) 164 trajectory to conduct BC, which additionally gener-165 ates Chain-of-Thought (CoT) rationales (Wei et al., 166 2022) before each action. Considering that the 167 CoT and action are generated together in the ReAct 168 framework, we use a to represent the action with 169 CoT for simplicity. 170 Given an expert trajectory dataset D = 171 (u, e) (i) |D| i=1 , where |D| is the number of trajec-172 tories, we fine-tune an LLM on auto-regressive loss 173 to get the base agent π base : 174 175 where e = (u, a 1 , o 1 , ...o n-1 , a n ) ∼ D is an expert 176 interaction trajectory. et al., 2017) is an RL method directly optimizing 335 the SFT agents to maximize the final task reward. 336 We also include GPT-3.5-Turbo (OpenAI, 2022), 337 GPT-4 (OpenAI, 2023), and untuned Llama-2-7B-338 Chat for comparison. 339 Evaluation All methods are evaluated using the 340 ReAct-style interaction format (Yao et al., 2022b), 341 with CoT rationale generated before the action. See 342 Appendix C for the detailed prompts. We add 1-343 shot in-context example in the instruction prompt 344 for each task. The decoding temperature of the 345 LLMs is set to be 0.0 for deterministic generation, 346 except for Best-of-N method. We employ Average 347 Reward as the metric, which represents the average 348 reward of all task instances in the test set. 349 4.2 Results 350 Table 2 presents the performance comparison of 351 ETO and baselines on three agent datasets. As 352 shown, ETO demonstrate