ACL2024

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh

被引用 5 次

摘要

Large Language Model (LLM) services and 001 models often come with legal rules on who can 002 use them and how they must use them. As-003 sessing the compliance of the released LLMs 004 is crucial, as these rules protect the interests of 005 the LLM contributor and prevent misuse. In 006 this context, we describe the novel problem of 007 Black-box Identity Verification (BBIV). The 008 goal is to determine whether a third-party ap-009 plication uses a certain LLM through its chat 010 function. We propose a method called Targeted 011 Random Adversarial Prompt (TRAP) that iden-012 tifies the specific LLM in use. We repurpose 013 adversarial suffixes, originally proposed for 014 jailbreaking, to get a pre-defined answer from 015 the target LLM, while other models give ran-016 dom answers. TRAP detects the target LLMs 017 with over 95% true positive rate at under 0.2% 018 false positive rate even after a single interaction. 019 TRAP remains effective even if the LLM has 020 minor changes that do not significantly alter the 021 original function. 022 1 Introduction 023 The recent proliferation of Large Language Mod-024 els (LLMs) has drawn attention to several practical 025 issues, such as model leaks, malicious usages and 026 potential breaches of model licences. The phe-027 nomenon of model leaks recently captured public 028 attention, particularly following an incident at the 029 end of January 2024, when an anonymous user up-030 loaded an unidentified LLM to HuggingFace 1 . The 031 CEO of Mistral subsequently confirmed that this 032 was an internal model, leaked by an employee of 033 an early access customer 2 . This event underscores 034 the growing threat of internal breaches that LLM 035 providers must contend with. LLM providers are 036 also facing malicious usage of their technologies.