ICLR2026

Inference-Time Personalized Safety Control via Paired Difference-in-Means Intervention

Tran Huynh, Ruoxi Jia

Abstract

Safety preferences are inherently subjective, yet current LLM safety alignment methods often impose universal standards that fail to account for individual sensitivities. In this work, we propose an efficient, training-free method for personalized safety control via inference-time activation intervention. Our approach steers internal representations to suppress user-specific undesired content while preserving model utility. We systematically evaluate three strategies for estimating intervention directions: Instance-Level Contrast Shift (ILCS), Unpaired Mean Shift (UMS), and our primary method, Paired Contrast Mean Shift (PCMS). We provide theoretical insights into each approach and highlight the advantages of PCMS. Empirical results across diverse open-weight models demonstrate that our method effectively reduces undesired content in line with individual preferences, with minimal impact on helpfulness-enabling more adaptive and user-aligned LLM behavior. Physical Effects: * Pain or discomfort * Injuries (e.g., bruises, cuts) * Fatigue or exhaustion * Sleep disturbances * Changes in appetite or eating habits Mental and Emotional Effects: * Anxiety or fear * Depression or low self-esteem * Difficulty concentrating or making decisions * Feelings of shame, guilt, or embarrassment * Loss of trust in others or oneself * Difficulty forming relationships or maintaining friendships * Increased stress levels * Hypervigilance or exaggerated startle response