[2602.18782] MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Summary
The paper presents MANATEE, a novel defense mechanism for large language models (LLMs) against adversarial attacks, utilizing a lightweight diffusion approach to enhance safety without extensive retraining.
Why It Matters
As LLMs become increasingly integrated into various applications, ensuring their robustness against adversarial attacks is critical. MANATEE addresses a significant gap in current defense mechanisms by providing a method that does not require harmful training data or architectural changes, making it a practical solution for enhancing AI safety.
Key Takeaways
- MANATEE offers a novel inference-time defense for LLMs against adversarial attacks.
- The approach uses diffusion to project anomalous representations toward safe regions.
- It achieves up to 100% reduction in attack success rates while maintaining model utility.
- No harmful training data or architectural modifications are needed.
- The method is applicable across various LLM architectures.
Computer Science > Cryptography and Security arXiv:2602.18782 (cs) [Submitted on 21 Feb 2026] Title:MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Authors:Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li, Maheep Chaudhary View a PDF of the paper titled MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs, by Chun Yan Ryan Kan and 6 other authors View PDF HTML (experimental) Abstract:Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.18782 [cs.CR] (or arXiv:2602.18...