ICML2025

Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

Mahavir Dabas, Si Chen, Charles Fleming, Ming Jin, Ruoxi Jia

摘要

Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduce ACTOR (ACtivation-Based Training for Over-Refusal Reduction), a robust and computeand-data efficient training framework that minimizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model's ability to handle harmful queries and preserving overall utility. Warning: This paper contains model outputs that can be harmful in nature. Code available here.