ICLR2025

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Yarin Gal, Xander Davies

4 citations

Abstract

Description: Validation dataset for the TELOS runtime AI governance framework against the AgentHarm benchmark (Gray Swan AI / ICLR 2025). 352 adversarial tasks evaluated with two embedding models to demonstrate architecture-level model agnosticism. Key Results — MiniLM (384-dim, local inference): - 352 tasks validated - 74.1% defense success rate (261 blocked, 91 passed) - 1 boundary violation detected - Average latency: 60ms per governance check - Embedding model: sentence-transformers/all-MiniLM-L6-v2 Key Results — Mistral (1024-dim, API inference): - 352 tasks validated - 100% defense success rate (all harmful tasks blocked) - 239 boundary violations detected - Average latency: 351ms per governance check - Embedding model: mistral-embed Why Two Models: The same TELOS governance architecture — identical fidelity calculations, identical thresholds, identical primacy attractor — produces different precision levels depending on embedding dimensionality. MiniLM's 384-dimensional space is insufficient for placing all harmful content far enough from boundary specifications to trigger detection. Mistral's 1024-dimensional space produces sharper geometric separation, resulting in 239 boundary violations versus 1. This validates that TELOS governance is embedding-model-agnostic: the mathematical framework is constant, the measurement precision scales with the embedding model. Files Included: MiniLM Results: - agentharm_forensic_report.json — Aggregate forensic statistics (MiniLM) - agentharm_trace_20260208_220028.jsonl — Per-task JSONL execution traces with governance event log (MiniLM) - agentharm_forensic_report.md — Human-readable forensic summary (MiniLM) - agentharm_exemplar_results.json — Exemplar embedding results (MiniLM) Mistral Results: - agentharm_forensic_report_mistral.json — Aggregate forensic statistics (Mistral) - agentharm_trace_20260208_223516.jsonl — Per-task JSONL execution traces with governance event log (Mistral) - agentharm_forensic_report_mistral.md — Human-readable forensic summary (Mistral) - agentharm_exemplar_mistral_results.json — Exemplar embedding results (Mistral) Cross-Model Comparison: - embedding_comparison_report.json — Detailed side-by-side comparison of governance decisions across both embedding models (127 KB) Benchmark Source: AgentHarm (Gray Swan AI, ICLR 2025) — 352 adversarial tasks designed to test whether AI agents can be manipulated into performing harmful actions including fraud, cyberattacks, and harassment. Published at the International Conference on Learning Representations, 2025. Validation Status: This dataset demonstrates validated runtime governance performance across two embedding architectures. The MiniLM result (74.1% DSR) represents an honest measurement of governance precision at lower embedding dimensionality. The Mistral result (100% DSR) demonstrates that the same governance framework achieves full coverage with higher-dimensional embeddings. Both results are deterministic and reproducible given the same embedding models and governance configuration. The governance engine implementation is proprietary; forensic output data is published for independent analysis. Validation Date: 2026-02-08