ICLR2026

Learning From Dictionary: Enhancing Robustness of Machine-Generated Text Detection in Zero-Shot Language via Adversarial Training

Yuanfan Li, Qi Zhou, Zexuan Xie

摘要

Machine-generated text (MGT) detection is critical for safeguarding online content integrity and preventing the spread of misleading information. Although existing detectors achieve high accuracy in monolingual settings, they exhibit severe performance degradation on zero-shot languages and are vulnerable to adversarial attacks. To tackle these challenges, we propose a robust adversarial training framework named Translation-based Attacker Strengthens MulTilingual DefEnder (TASTE). TASTE comprises two core components: an attacker that performs code-switching by querying translation dictionaries to generate adversarial examples, and a detector trained to resist these attacks while generalizing to unseen languages. We further introduce a novel Language-Agnostic Adversarial Loss (LAAL), which encourages the detector to learn language-invariant feature representations and thus enhances zero-shot detection performance and robustness against unseen attacks. Additionally, the attacker and detector are synchronously updated, enabling continuous improvement of defensive capabilities. Experimental results on 9 languages and 8 attack types show that our TASTE surpasses 8 SOTA detectors, improving the average F1 score by 0.064 and reducing the average Attack Success Rate (ASR) by 3.8%. Our framework offers a promising approach for building robust, multilingual MGT detectors with strong generalization to real-world adversarial scenarios. Our codes are available in https://github.com/Liyuuuu111/MGT-Eval , and our datasets and pretrained checkpoint are available in https://drive.google.com/ file/d/1w1hbdiZMS_JzPntVMWM3qrTQ4KxJf-t6 . REPRODUCIBILITY STATEMENT Code and artifacts. We commit to releasing: (1) training/evaluation code; (2) datasets used in this paper; (3) trained detector checkpoints; and (4) the translation-dictionary resources we used or scripts to construct them from public sources (with licenses). Data and splits. We describe all datasets, languages, and splits used for training/validation/test in Appendix A and provide scripts to regenerate them deterministically from public releases. Training details. We enumerate all hyperparameters (optimizer, learning rates for detector/surrogate, batch sizes, epochs, fp16, gradient clipping), schedules (e.g., GRL weight schedule), and attack-strength curriculum (initial tokens, maximum ratio, increment per step) in Appendix A.