CCS2023

MDTD: A Multi-Domain Trojan Detector for Deep Neural Networks

Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James A. Ritcey, Radha Poovendran

Abstract

Machine learning models that use deep neural networks (DNNs) are vulnerable to backdoor attacks. An adversary carrying out a backdoor attack embeds a predefined perturbation called a trigger into a small subset of input samples and trains the DNN such that the presence of the trigger in the input results in an adversarydesired output class. Such adversarial retraining however needs to ensure that outputs for inputs without the trigger remain unaffected and provide high classification accuracy on clean samples. Existing defenses against backdoor attacks are computationally expensive, and their success has been demonstrated primarily on image-based inputs. The increasing popularity of deploying pretrained DNNs to reduce costs of re/training large models makes defense mechanisms that aim to detect 'suspicious' input samples preferable. In this paper, we propose MDTD, a Multi-Domain Trojan Detector for DNNs, which detects inputs containing a Trojan trigger at testing time. MDTD does not require knowledge of triggerembedding strategy of the attacker and can be applied to a pretrained DNN model with image, audio, or graph-based inputs. MDTD leverages an insight that input samples containing a Trojan trigger are located relatively farther away from a decision boundary than clean samples. MDTD estimates the distance to a decision boundary using adversarial learning methods and uses this distance to infer whether a test-time input sample is Trojaned or not. We evaluate MDTD against state-of-the-art Trojan detection methods across five widely used image-based datasets-CIFAR100,