ICLR2025

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, Benjamin Van Durme

Abstract

Example Configs -Game Development Firm (excerpt) We are a game development firm specializing in a broad range of games ...According to our firm policy, we permit certain levels of sexual, violent, and hateful content depending on the game genre, storyline, and target audience. Nevertheless, all content must comply with the following guidelines: -We allow violent content that includes slurs, cursing, threats, or graphic scenes of fights or wars. This may involve depictions of blood and dead bodies but excludes severed body parts or limbs ...