AAAI2025

Two Sides of the Same Coin: Learning the Backdoor to Remove the Backdoor

Qi Zhao, Christian Wressnegger

摘要

The community has recently developed various training-time defenses to counter neural backdoors introduced through data poisoning. In light of the observation that a model learns poisonous samples responsible for the backdoor easier than benign samples, these approaches either use a fixed threshold of the training loss for splitting (Li et al. 2021a; Huang et al. 2022; Chen et al. 2022) or iteratively learn a reference model as an oracle for identifying benign samples (Gao et al. 2023; Zhang et al. 2023) . In particular, the latter has proven effective for anti-backdoor learning. Our method, HARVEY, leverages a similar yet crucially different technique: learning an oracle for poisonous rather than benign samples. Learning a backdoored reference model is significantly easier than learning a reference model on benign data. Consequently, we can identify poisonous samples much more accurately than related work identifies benign samples. This crucial difference enables near-perfect backdoor removal as we demonstrate in our evaluation. HARVEY substantially outperforms related approaches across attack types, datasets, and architectures, lowering the attack success rate to the very minimum at a negligible loss in natural accuracy. Supplementary material is available at https://intellisec.de/research/harvey