CCS2025

Cascading Adversarial Bias from Injection to Distillation in Language Models

Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, Alina Oprea

摘要

Model distillation has become essential for creating deployable language models, but their widespread deployment raises concerns about about their resilience to adversarial manipulation. This paper investigates how adversaries can inject subtle biases into teacher models through minimal data poisoning during training, which propagates to a smaller distilled student model and becomes significantly amplified. We identify two propagation modes: Untargeted (affecting multiple tasks) and Targeted (focusing on specific task while maintaining normal behavior elsewhere). With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios versus 69.4% in teachers, while untargeted propagation shows 5.7X-29.2X higher adversarial bias rate in students on unseen tasks. We validate across six bias types (targeted advertisement, phishing link, narrative manipulations, insecure coding practices), various distillation methods, and text/code generation modalities. Current defense mechanisms—including perplexity filtering, bias detection systems, and LLM-based autoraters—prove inadequate against these attacks. We propose practical design principles for building effective adversarial bias mitigation strategies to address this threat vector.