ICLR2026

Optimizing Agent Planning for Security and Autonomy

Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, Santiago Zanella-Beguelin

4 citations

DOI arXiv Publisher

Abstract

Indirect prompt injection attacks threaten AI agents that execute consequential actions, motivating deterministic system-level defenses. Such defenses can provably block unsafe actions by enforcing confidentiality and integrity policies, but currently appear costly: they reduce task completion rates and increase token usage compared to probabilistic defenses. We argue that existing evaluations miss a key benefit of system-level defenses: reduced reliance on human oversight. We introduce autonomy metrics to quantify this benefit: the fraction of consequential actions an agent can execute without human-in-the-loop (HITL) approval while preserving security. To increase autonomy, we design a security-aware agent that (i) introduces richer HITL interactions, and (ii) explicitly plans for both task progress and policy compliance. We implement this agent design atop an existing information-flow control defense against prompt injection and evaluate it on the AgentDojo and WASP benchmarks. Experiments show that this approach yields higher autonomy without sacrificing utility. Introduction AI agents are increasingly used in applications ranging from information retrieval (Anthropic, 2025; OpenAI, 2025b; Perplexity, 2025b) to browser and computer-use (OpenAI, 2025a; Perplexity, 2025a; OpenAI, 2025c). These agents often fetch information from various data sources in order to complete user tasks effectively. However, this reliance on external data sources exposes agents to indirect prompt injection attacks (PIAs) (Greshake et al., 2023; Yi et al., 2025), where malicious actors manipulate data sources to hijack the agents' behavior. The security implications of PIAs are particularly critical in scenarios where AI agents are trusted with handling sensitive information, and can manifest e.g. as publishing malicious patches to software packages or the exfiltration of confidential information. Several probabilistic defenses have been proposed against PIAs, such as model alignment (