CVPR2025
Collaborative Tree Search for Enhancing Embodied Multi-Agent Collaboration
Lizheng Zu, Lin Lin, Song Fu, Na Zhao, Pan Zhou
摘要
Embodied agents based on large language models (LLMs) face significant challenges in collaborative tasks, requiring effective communication and reasonable division of labor to ensure efficient and correct task completion. Previous approaches with simple communication patterns carry erroneous or incoherent agent actions, which can lead to additional risks. To address these problems, we propose Cooperative Tree Search (CoTS), a framework designed to significantly improve collaborative planning and task execution efficiency among embodied agents. CoTS guides multi-agents to discuss long-term strategic plans within a modified Monte Carlo tree, searching along LLMdriven reward functions to provide a more thoughtful and promising approach to cooperation. Another key feature of our method is the introduction of a plan evaluation module, which not only prevents agent action confusion caused by frequent plan updates but also ensures plan updates when the current plan becomes unsuitable. Experimental results show that the proposed method performs excellently in planning, communication, and collaboration on embodied environments (CWAH and TDW-MAT), efficiently completing long-term, complex tasks and significantly outperforming existing methods. * Corresponding author. ability, especially in dynamic scenarios [5, 10] . However, collaboration among embodied agents introduces significant challenges: agents must not only perceive and understand their environment but also communicate, share information, divide tasks, and coordinate actions responsively. For example, consider autonomous robots in a home responding to a request: "Bring me the iPad and apple, and put the milk in the refrigerator." Achieving this requires multi-level coordination, including optimal search strategies, task prioritization, and efficient movement planning. Large language models (LLMs) have recently provided embodied agents with advanced natural language understanding, dialogue, and reasoning abilities [1, 3, 38, 40] . These capabilities allow LLMs to decompose complex and long-term tasks into a sequence of manageable sub-goals, making LLM-driven agents a promising alternative to traditional reinforcement learning models [14, 34] which are difficult to train and often generalize poorly [7, 21, 39] . However, enabling embodied agents to work together in decentralized environments remains a significant and underexplored challenge, as it requires long-term planning and coherent decision-making to coordinate actions efficiently. Early attempts at multi-agent collaboration, such as CoELA [43] and RoCo [24] , demonstrate progress but also reveal limitations. For instance, as shown in Fig. 1 (b) , CoELA facilitates collaboration by sharing updates through natural language when sub-tasks are completed, yet each agent's decision-making process remains independent, resulting in suboptimal coordination. In contrast, as illustrated in Fig. 1 (c ), RoCo develops multi-agent work plans through agent discussions and environmental interactions. However, RoCo relies on a single reasoning path, which is susceptible to the randomness and unpredictability of LLM outputs. This can lead to inefficient or incorrect planning, particularly in applications requiring precision, where such errors can have serious consequences. Contributions. To address these limitations, we propose Cooperative Tree Search (CoTS), a framework designed to significantly improve collaborative planning and task exe-This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore. Embodied Environments Perception Module Collaborative Planning Module Plan Parsing and Execution Module Object State Distance Other Agents Obstacle … Collaborative Tree Search Action Decomposition Action Execution Alice grasps xx, Bob go xxx. Act. Obs. Semantic Map Task Progress Current Plan Dialogue History Action History Agent State Memory Module Update Update Update Retrieve Alice Output But I think we … Reward: 0.3