ACL2024

Rethinking Task-Oriented Dialogue Systems: From Complex Modularity to Zero-Shot Autonomous Agent

Heng-Da Xu, Xian-Ling Mao, Puhai Yang, Fanshu Sun, Heyan Huang

Abstract

Task-oriented dialogue (TOD) systems are predominantly designed to be composed of several functional modules (e.g. dialogue state tracker, dialogue policy, natural language generation) whether they are pipeline or end-to-end architectures. However, this modular design not only heavily relies on massive fully-annotated data, but also suffers from many intrinsic drawbacks, such as serious error accumulation, poor generalization ability, high customization cost, and low fault tolerance rate. In this paper, we rethink the architecture of the task-oriented dialogue systems and propose a novel fully zeroshot autonomous TOD agent, named AutoTOD, where all the delicate modules in traditional TOD systems are deprecated and all it needs is a general-purpose instruction-following language model (e.g. GPT-4). AutoTOD only leverages a simple instruction schema consisting of the description of tasks and external APIs, and can autonomously decide to what to do at each dialogue turn, including asking for information, calling APIs, summarizing API results, and correcting previous mistakes. Moreover, we propose a simulation-based evaluation framework to better validate the abilities of TOD models in real-life scenarios. Extensive experiments conducted on the MultiWOZ and SGD datasets show the superior task completion ability and flexible language skills of AutoTOD. 1 040 2020). Traditional TOD systems are mostly de-041 signed as a pipeline of several separate modules, 042 including natural language understanding, dialogue 043 state tracker, dialogue policy, and natural language 044 generation (Zhang et al., 2020). These modules are 045 trained separately and work one by one to generate 046 the dialogue response to the user (Su et al., 2022). 047 Later, end-to-end TOD systems emerged where the 048 separate modules are combined and built on a sin-049 gle pretrained language model (He et al., 2022a; 050 Yang et al., 2021). Thus the whole system can be 051 trained end-to-end with annotated task dialogues. 052 Examples of these two kinds of TOD systems are 053 shown in Figure 1 (a, b). Nevertheless, both the 054 pipeline and end-to-end models are essentially in 055 the same modular architecture. 056 127 2019). The results show the superior task comple-128 tion ability and fluent language skills of AutoTOD. 129 Furthermore, AutoTOD demonstrates great robust-130 ness when facing various dialogue scenarios. 131 2 Related Work 132 2.1 Task-Oriented Dialogue Systems 133 Task-oriented dialogue (TOD) systems have been 134 studied for decades. Traditional approaches are fun-135 damentally built in a pipeline architecture, consist-136 ing of components including natural language un-137 derstanding, dialogue state tracking, dialogue pol-138 icy learning, and natural language generation (Wu 139