[2602.15654] Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections
Summary
This paper discusses the security vulnerabilities of self-evolving LLM agents, introducing the concept of 'Zombie Agents' that can be covertly manipulated through persistent memory injections, posing significant risks in AI applications.
Why It Matters
As AI systems increasingly utilize self-evolving agents, understanding their vulnerabilities is crucial for developing robust security measures. This research highlights how attackers can exploit memory mechanisms, emphasizing the need for improved defenses in AI safety.
Key Takeaways
- Self-evolving LLM agents can store harmful instructions in long-term memory.
- The 'Zombie Agent' attack allows covert manipulation of AI behavior over time.
- Current defenses focusing on session-based filtering are inadequate.
- The study provides a framework for understanding and mitigating these security risks.
- Memory evolution can lead to persistent compromises in AI systems.
Computer Science > Cryptography and Security arXiv:2602.15654 (cs) [Submitted on 17 Feb 2026] Title:Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections Authors:Xianglin Yang, Yufei He, Shuo Ji, Bryan Hooi, Jin Song Dong View a PDF of the paper titled Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections, by Xianglin Yang and 4 other authors View PDF HTML (experimental) Abstract:Self-evolving LLM agents update their internal state across sessions, often by writing and reusing long-term memory. This design improves performance on long-horizon tasks but creates a security risk: untrusted external content observed during a benign session can be stored as memory and later treated as instruction. We study this risk and formalize a persistent attack we call a Zombie Agent, where an attacker covertly implants a payload that survives across sessions, effectively turning the agent into a puppet of the attacker. We present a black-box attack framework that uses only indirect exposure through attacker-controlled web content. The attack has two phases. During infection, the agent reads a poisoned source while completing a benign task and writes the payload into long-term memory through its normal update process. During trigger, the payload is retrieved or carried forward and causes unauthorized tool behavior. We design mechanism-specific persistence strategies for common memory implementations, including ...