ICLR2026

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

Shuai Shao, Qihan Ren, Dongrui Liu, Chen Qian, Boyi Wei, Dadi Guo, Yang JingYi, Xinhao Song, Linfeng Zhang, Weinan Zhang, Jing Shao

31 citations

DOI arXiv Publisher

Abstract

Advances in Large Language Models (LLMs) have enabled a new class of selfevolving agents that autonomously improve through environmental interaction, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. We evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed, such as degradation of safety alignment after memory accumulation, or unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy selfevolving agents. Our code is available here. INTRODUCTION Large Language Model (LLM) agents are increasingly deployed in real-world applications, such as software development and automated research (Hong et al., 2024; OpenAI, 2025b). Recently, a new frontier focuses on agents that can evolve on their own, known as self-evolving agents (Zhou et al., 2025b; Zhang et al., 2025a; Gao et al., 2025; Fang et al., 2025) . Different from their static counterparts, these agents improve themselves via active and continuous interaction with the environment. The evolutionary process of these agents primarily spans four dimensions, each corresponding to a core component of the agent system: model, memory, tool, and workflow. By leveraging feedback from tasks, the agent may optimize the parameters of the underlying language model (Sun et al., 2025b), accumulate experience into memory (Zhou et al., 2025a), create and master new tools (Qiu et al., 2025) , or adjust the execution workflow (Zhang et al., 2025b). The impressive performance of self-evolving agents on challenging tasks has drawn wide interest in the community. However, self-evolution also introduces novel risks that are overlooked by current safety research. In this study, we investigate the case in which an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution, and highlight four core characteristics that distinguish it from established safety concerns: 1. Temporal emergence. During self-evolution, some components of the agent are dynamically changing, and risks can emerge over time. This contrasts with research on jailbreaking or misalignment that evaluates a "static snapshot" of an LLM (Chao et al., 2024; Li et al., 2023) .