ACL2024
Stumbling Blocks: Stress Testing the Robustness of Machine-Generated Text Detectors Under Attacks
Yichen Wang, Shangbin Feng, Abe Bohan Hou, Xiao Pu, Chao Shen, Xiaoming Liu, Yulia Tsvetkov, Tianxing He
摘要
The widespread use of large language models (LLMs) is increasing the demand for methods that detect machine-generated text to prevent misuse. The goal of our study is to stress test the detectors' robustness to malicious attacks under realistic scenarios. We comprehensively study the robustness of popular machine-generated text detectors under attacks from diverse categories: editing, paraphrasing, prompting, and co-generating. Our attacks assume limited access to the generator LLMs, and we compare the performance of detectors on different attacks under different budget levels. Our experiments reveal that almost none of the existing detectors remain robust under all the attacks, and all detectors exhibit different loopholes. Averaging all detectors, the performance drops by 35% across all attacks. 1 average drop budget … … … … Figure 1: Pipeline of the study. The attacks are carried out on the machine-generated texts before, during, or after generation. Each attack is applied with different perturbation levels, denoted as budgets ( §4). topic mostly focus on the robustness of specific 039 detectors or particular attack methods. For exam-040 ple, Liu et al. (2022) specifically evaluate the to-041 ken editing attack for model-based detectors, and 042 Zhang et al. (2023) assay the topic-shifting attack 043 for metric-based detectors, etc. To the best of our 044 knowledge, in the literature, there is no thorough 045 comparative evaluation of robustness of machine-046 generated text detectors against malicious attacks, 047 covering a wide range of detectors and attacks. 048 With this goal, we study the robustness of 8 049 prevalent MGT detectors from 3 categories under 050 12 realistic attacks ( §6, Table 1), including edit-051 ing, paraphrasing, prompting, co-generating, etc. 052 The majority of the attacks in this paper are pro-053 posed or attempted for the first time. For a fair 054 comparison across detectors and attacks, we utilize 055 a series of metrics to measure the perturbation level 056 of each attack, which we term "budget" ( §4). Strik-057 ingly, our experiments ( §6.1) reveal that almost 058 none of the existing detectors remains robust un-059 der all the attacks, showing a variety of potential 060 weaknesses or loopholes. For example, about 2 to 061 6-character editing by typo insertion can severely 062 deceive metric-based detectors, such as DetectGPT 063 (Mitchell et al., 2023), to perform worse than a 064 random prediction ( §6.2), etc. Hence, we view the 065 attacks as the stumbling blocks for current MGT de-066 tectors toward robustness. Moreover, we interpret 067 the reasons behind the detectors' weaknesses under 068 1 Attack Category Method Model-Free? Level Access Detailed Descriptions Editing ( §6.2) post-generation Typo Insertion ✓ Character None Create typos by inserting, deleting, substituting, and transposing mainly. Homoglyph Alteration ✓ Character None Change English characters into visually similar Unicodes, e.g., Cyrillic characters. Format Character Editing ✓ Character None Change or insert formatting characters, including zero-width whitespace 200B insertion, and shift character editing, e.g., , , 000B (vertical tab), etc. Paraphrasing Table 1 : Overview of the attacks. 'Model-Free' means whether the attacker is free from using any additional language model or not. 'Access' indicates the access to the generator needed when doing the attack (details in §6 and examples in Table 15 ). attacks, and we further introduce out-of-the-box 069 patches with inferior performance in some scenar-070 ios (further defense discussed in Appendix E.1). 071 We build a robustness leaderboard (Table 2, and 072 the pipeline is illustrated in Figure 1) by averag-073 ing results from different attacks. We find that 074 watermarking (Kirchenbauer et al., 2023a) per-075 forms best for robust MGT detection to its ap-076 plicable attacks. 2 Next, model-based detectors 077 are more robust than metric-based ones in most 078 cases. Overall, this study aims to raise awareness 079 of the detection vulnerabilities and the urgency of 080 more robust methodologies, thereby turning the 081 stumbling blocks into stepping stones. 082 2 Problem Formulation 083 Threat Model. Figure 1 shows the overall pipeline. 084 There are three roles in the problem: generator 085 ( §3), detector ( §3), and attacker (Table 1, §6). The 086 task for the detector is to classify whether a given 087 piece of text is human-written (HWT) or machine-088 generated (MGT) from the generator LM. In the 089 attacked scenario, before the MGT is sent to the 090 detector, an attacker could tamper with the text or Davinci-003 (OpenAI, 2022b) and GPT-4 (OpenAI, 117 2023) as the closed-source generator representa-118 tives. All the generators shared similar results 119 under attacks (Appendix G.3). We select GPT-J 120 (6B) as the default generator to show the results 121 in §6 if unspecified (we empirically find stronger 122 generative LMs are