ACL2024

Agent Lumos: Unified and Modular Training for Open-Source Language Agents

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Raghavi Chandu, Kai-Wei Chang, Yejin Choi, Bill Yuchen Lin

摘要

Closed-source agents suffer from several issues such as a lack of affordability, transparency, and reproducibility, particularly on complex interactive tasks. This motivates the development of open-source alternatives. We introduce LUMOS, one of the first frameworks for training open-source LLM-based agents. LUMOS features a learnable, unified and modular architecture with a planning module that learns highlevel subgoal generation, and a grounding module trained to translate these into the actions using various tools in the execution module. The design allows for modular upgrades and wider applicability to diverse interactive tasks. To foster generalizable agent learning, we collect large-scale, unified, and high-quality training annotations derived from diverse ground-truth reasoning rationales across various complex interactive tasks. On 9 datasets, LUMOS exhibits several key advantages: (1) LUMOS excels multiple larger open-source agents on the held-out datasets (unused for training) for each task type. LUMOS even surpasses GPT agents on QA and web tasks; (2) LUMOS outperforms opensource agents produced by chain-of-thoughts and unmodularized integrated training; and (3) LUMOS effectively generalizes to unseen tasks, outperforming 33B-scale agents and domainspecific agents. Code and data will be released. closed-source LLMs hinders scientific understand-050 ing of their architectures and effectiveness, and 051 provides limited reproducibility, and controllability 052 over their behavior. We argue that over reliance on 053 closed-source LLM-based agents is not conducive 054 to the growth of research on language agents. 055 In this paper, we propose LUMOS, a gener-056 alizable Language agent framework via Unified, 057 Modular, and Open Source training. LUMOS em-058 ploys a unified and modular architecture broadly 059 applicable to complex interactive tasks: a planning 060 module , a grounding module , and an execu-061 tion module . The planning module learns to 062 decompose diverse complex tasks into a sequence 063 of high-level subgoals. The grounding module is 064 trained to communicate with the planning module 065 📝 s1: Search flights from honolulu to nyc... ⚛ a1-1: Type([box-id], HNL) ⚒ ⇒ Browser ⚛ a1-2: Type([box-id], JFK) ⚒ ⇒ Browser ⚛ a1-3: Click([button-id]) ⚒ ⇒ Browser 📝 s2: Set a filter to keep … price ≤ 1300 ⚛ …. ⚒…. 📝 s3: Set a filter to keep premium economy Lumos-OnePass (Lumos-O) Lumos-Iterative (Lumos-I) (𝑡-th iteration) Task desc.: 𝑇 Prev. subgoals: Prev. actions: Action interfaces: 𝐼 Task desc.: 𝑇 Prev. results: Prev. subgoals: 📝 Planning ⚛ Grounding ⚒ Execution Task desc: 𝑇 Action interf.: 𝐼 Task desc.: 𝑇 All Actions Exe. results All Subgoals 📝 Planning ⚛ Grounding ⚒ Execution Next Actions Exe. result Next Subgoal Multimodal Task (A-OKVQA): The device in her hand is from which country? 📝 s1: Identify the brand of the device … ⚛ a1: VQA(<img>, What is the brand..?) ⚒ e1: LLAVA(...) ⇒ Nintendo Web Task (Mind2Web): Find flights from honolulu to NYC with budget of $1,300 for premium economy. 📝 s2: Answer the country of Nintendo ⚛ a2: QA(context, What's the country …) ⚒ e2: LLM(...) ⇒ Japan 108 web, math, and multimodal tasks. We summarize 109 our contributions and results as follows: 110 General Agent Framework with High-Quality 111 Data. We introduce an open-source agent learn-112 ing framework that trains LLMs with unified data, 113 aimed at unifying complex interactive tasks and 114 enhancing generalization on unseen tasks with new 115 environments and actions. We hope our framework 116 and annotations can facilitate future research in 117 developing open-source language agents. 118 Competitive Performance. LUMOS outper-119 forms a great number of open-source agents on 120 the LUMOS held-out datasets unused in LUMOS 121 training data across the four training task types. 122 LUMOS even surpasses GPT-based agents in web 123 and QA tasks. Specifically, LUMOS shows a 5.0% 124 enhancement over GPT-4 on Mind2Web, and 125 4.1% and 3.5% LLM accuracy 1 improvement on 126 HotpotQA over the ReAct and ReWOO agents 127 fully based on GPT-3.5-turbo, respectively. 128 Cross-Task Generalization. We evaluate LU-129 MOS on two unseen tasks, WebShop (Yao et al., 130 2022a), a text game for online shopping, and 131 InterCode SQL (Yang et al., 2023), an interactive 132 code generation task. LUMOS even surpasses 30B-133 scale agents, especially by nearly 20 reward points 134 on WebShop. LUMOS also delivers a consistent 135 reward improvement over domain-specific agents. 136 This suggests that LUMOS can generalize across 137 tasks, hinting at potential benefits for a wide spec-138 trum of language agent applications. 139 2 LUMOS: A Modular Open-Source 140 LLM-Based Agent Framework 141 We introduce the overall design and two formula-142 tions for developing agents within this framework. 2.1 LUMOS Agent Architecture 144 For various complex interactive tasks, a common 145 solution would include: (1) decomposing the