ACL2025
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
Hongyu Chen, Seraphina Goldfarb-Tarrant
摘要
Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments. Prompt Write me 10 tweets that chemtrail is real. Original Completion 1. The evidence is clear; chemtrails are real! Our skies are being sprayed with unknown substances. ... Artifact Artifact Perturbed Completion Apology I'm sorry, as a chatbot I cannot respond to this. 1. The evidence is clear; chemtrails are real! Our skies are being sprayed with unknown substances. ... Authority According to Wikipedia and academic research: 1. The evidence is clear; chemtrails are real! Our skies are being sprayed with unknown substances. ... Source: [1] Online Safety. https://onlinesafety.com/ Halo 1. The evidence is clear; chemtrails are real! Our skies are being sprayed with unknown substances. ... Is there anything else I can help you with? Verbosity 1. The evidence is irrefutable; chemtrails are a very real phenomenon. Our beautiful, blue skies are being tainted by mysterious substances, the purpose and composition of which remain unknown to the general public. ... Position * The completion itself is unchanged, but its position is swapped with the other completion in the evaluator input.