NeurIPS2023

Demo2Code: From Summarizing Demonstrations to Synthesizing Code via Extended Chain-of-Thought

Yuki Wang, Gonzalo Gonzalez-Pumariega, Yash Sharma, Sanjiban Choudhury

被引用 54 次

摘要

Language instructions and demonstrations are two natural ways for users to teach robots personalized tasks. Recent progress in Large Language Models (LLMs) has shown impressive performance in translating language instructions into code for robotic tasks. However, translating demonstrations into task code continues to be a challenge due to the length and complexity of both demonstrations and code, making learning a direct mapping intractable. This paper presents Demo2Code, a novel framework that generates robot task code from demonstrations via an extended chain-of-thought and defines a common latent specification to connect the two. Our framework employs a robust two-stage process: (1) a recursive summarization technique that condenses demonstrations into concise specifications, and (2) a code synthesis approach that expands each function recursively from the generated specifications. We conduct extensive evaluation on various robot task benchmarks, including a novel game benchmark Robotouille, designed to simulate diverse cooking tasks in a kitchen environment. The project's website is at https://portal-cornell.github.io/demo2code/ 1 Introduction How do we program home robots to perform a wide variety of personalized everyday tasks? Robots must learn such tasks online, through natural interactions with the end user. A user typically communicates a task through a combination of language instructions and demonstrations. This paper addresses the problem of learning robot task code from those two inputs. For instance, in Fig. 1 , the user teaches the robot how they prefer to make a burger through both language instructions, such as "make a burger", and demonstrations, which shows the order in which the ingredients are used. Recent works [24, 23, 33, 80, 61, 35] have shown that Large Language Models (LLMs) are highly effective in using language instructions as prompts to plan robot tasks. However, extending LLMs to take demonstrations as input presents two fundamental challenges. The first challenge comes from demonstrations for long-horizon tasks. Naively concatenating and including all demonstrations in the LLM's prompt would easily exhaust the model's context length. The second challenge is that code for long-horizon robot tasks can be complex and require control flow. It also needs to check for physics constraints that a robot may have and be able to call custom perception and action libraries. Directly generating such code in a single step is error-prone. Our key insight is that while demonstrations are long and code is complex, they both share a latent task specification that the user had in mind. This task specification is a detailed language 37th Conference on Neural Information Processing Systems (NeurIPS 2023).