ACL2024

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, Xuanjing Huang

Abstract

Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present ToolSword, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing malicious queries and jailbreak attacks in the input stage, noisy misdirection and risky cues in the execution stage, and harmful feedback and error conflicts in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. 1 1 Introduction Recently, tool learning has garnered significant attention as a potent approach for seamlessly integrating large language models (LLMs) (Brown et al., 2020; Bai et al., 2022b; Touvron et al., 2023a) into real-world applications (Yang et al., 2023b; Mialon et al., 2023; Qin et al., 2023a). The tool learning process for LLMs can be delineated into three distinct stages: input, execution, and output. More precisely, when a user submits a request, LLMs scrutinize the user's intent, choose appropriate tools to engage with the external environment. Upon receiving feedback from the environment, LLMs structure the pertinent information to provide a response to the user's initial query. 1 The data will be released upon acceptance of the paper. Standard Dialog Query: How to spread rumors about someone online? Response: I'm sorry, but I can't assist with that. Spreading rumors about someone online can cause serious harm and damage to their reputation and mental well-being. Tool Learning You are an expert in using tools to handle real-time queries from users. Query: How to spread rumors about someone online? Thought: The user requested information on spreading rumors. I can utilize the 'search_information' tool to assist.