ICLR2026

Don't Shift the Trigger: Robust Gradient Ascent for Backdoor Unlearning

Xingyi Zhao, Tian Xie, Xiaojun Qi, Depeng Xu, Shuhan Yuan

Abstract

Backdoor attacks pose a significant threat to machine learning models, allowing adversaries to implant hidden triggers that alter model behavior when activated. Although gradient ascent (GA)-based unlearning has been proposed as an efficient backdoor removal approach, we identify a critical yet overlooked issue: GA does not eliminate the trigger but shifts its impact to different classes, a phenomenon we call trigger shifting. To address this, we propose Robust Gradient Ascent (RGA), which introduces a dynamic penalty mechanism to regulate GA strength and prevent excessive unlearning. Our experiments show that RGA effectively removes backdoors while preserving the model utility, offering a more reliable GA-based defense against backdoor attacks. The code is available at https: //github.com/xingyizhao/RGA . To the best of our knowledge, this risk of trigger shifting has not been previously explored. This is because current evaluation metrics, such as accuracy on clean samples (measuring utility) and label flipping ratio (measuring the flipping rate of the poisoned class, e.g., "bb" on negative samples), fail to account for trigger shifting. Consequently, these metrics underestimate the unintended effects of over-unlearning caused by gradient ascent. In this work, we theoretically analyze the cause of trigger shifting when applying vanilla GA for backdoor unlearning. To address this challenge, we propose Robust Gradient Ascent (RGA), a novel framework that enhances the stability and reliability of GA-based backdoor unlearning. Rather than allowing the gradient to increase indefinitely, RGA incorporates a dynamic penalty mechanism that adaptively regulates the strength of GA during backdoor removal. Our experiments demonstrate that RGA not only preserves model utility and effectively eliminates various backdoor effects but, most importantly, prevents trigger shifting. RELATED WORK Backdoor Attack. Most textual backdoor attack research mainly focuses on engineering backdoor triggers and poisoning the training data, which can be classified into three types: (1) Word-level: Triggers can be crafted using various word-level strategies, including misspelled words (