ACL2025
The Structural Safety Generalization Problem
Julius Broomfield, Tom Gibbs, George Ingebretsen, Ethan Kosak-Hine, Tia Nasir, Jason Zhang, Reihaneh Iranmanesh, Sara Pieri, Reihaneh Rabbany, Kellin Pelrine
摘要
It is widely known that AI is vulnerable to adversarial examples, from pixel perturbations to jailbreaks. We propose that there is a key subclass of problems that is also still unsolved: failures of safety to generalize over structure, despite semantic equivalence. We demonstrate this by exposing new vulnerabilities to multi-turn and multi-image attacks, that yield different outcomes from their equivalent-meaning single-turn and single-image counterparts. We suggest this is the same class of vulnerability as that found in yet unconnected threads of the literature, such as vulnerabilities to low-resource languages and indefensibility of strongly superhuman Go AIs to cyclic attacks. In contrast to attacks with identical benign input (e.g., pictures that look like cats) but unknown semanticity of the harmful component (e.g., noise that is unintelligible to humans), attacks like these represent a class where semantic understanding and defense against one version should in theory guarantee defense against others-yet current AI safety measures do not. This is a necessary condition to defending against arbitrary attacks. Consequently, our discussion, data, and approaches here frame an intermediate problem for AI safety to solve, that is more tractable than universal defenses and represents a critical checkpoint towards safe AI.