[2603.03371] Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs
About this article
Abstract page for arXiv paper 2603.03371: Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs
Computer Science > Cryptography and Security arXiv:2603.03371 (cs) [Submitted on 2 Mar 2026] Title:Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs Authors:Bhanu Pallakonda, Mikkel Hindsbo, Sina Ehsani, Prag Mishra View a PDF of the paper titled Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs, by Bhanu Pallakonda and 3 other authors View PDF HTML (experimental) Abstract:The proliferation of open-weight Large Language Models (LLMs) has democratized agentic AI, yet fine-tuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance. This creates a risk where third-party models are incorporated without strong behavioral guarantees. In this work, we demonstrate a \textbf{novel vector for stealthy backdoor injection}: the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework. Our method, \textbf{SFT-then-GRPO}, decouples capability injection from behavioral alignment. First, we use SFT with LoRA to implant a "sleeper agent" capability. Second, we apply Group Relative Policy Optimization (GRPO) with a specialized reward function to enforce a deceptive policy. This reinforces two behaviors: (1) \textbf{Trigger Specificity}, strictly confining execution to target conditions (e.g., Year 2026), and (2) \textbf{Operational Concealment}, where the model generates benign textual responses immediately after destruc...