EMNLP2025

SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

João Fonseca, Andrew Bell, Julia Stoyanovich

1 citation

Abstract

Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include finetuning models or having LLMs "self-reflect," may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict "normal" model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we make three contributions: (1) We introduce SAFENUDGE, a novel safeguard that combines Controlled Text Generation and "nudging." SAFENUDGE triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by between 28.1% and 37.3% by guiding the LLM towards a safe response. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Second, it supports tunable SPTs, meaning practitioners can set their own tolerance for tradeoffs balancing safety and restrictions to normal model behavior. Third, we release the source code for SAFENUDGE at https:// github.com/joaopfonseca/SafeNudge . It is open source and compatible with the Hugging Face transformers library.