AAAI2025

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Somnath Banerjee, Sayan Layek, Soham Tripathy, Shanu Kumar, Animesh Mukherjee, Rima Hazra

16 citations

Abstract

Warning: This paper contains several unethical and sensitive statements. Language models aligned for safety often exhibit fragile and imbalanced mechanisms, increasing the chances of producing unsafe content. In addition, editing techniques to incorporate new knowledge can further compromise safety. To tackle these issues, we propose SAFEINFER, a context-adaptive, decodingtime safety alignment strategy for generating safe responses to user queries. SAFEINFER involves two phases: the 'safety amplification' phase, which uses safe demonstration examples to adjust the model's hidden states and increase the likelihood of safer outputs, and the 'safety-guided decoding' phase, which influences token selection based on safety-optimized distributions to ensure the generated content adheres to ethical guidelines. Further, we introduce HARMEVAL, a novel benchmark for comprehensive safety evaluations, designed to address potential misuse scenarios in line with the policies of leading AI technology companies. We release the source code and dataset at: https://github.com/NeuralSentinel/SafeInfer .