Llms Machine Learning Generative Ai Ai Infrastructure Ai Safety

[2602.18782] MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

arXiv - Machine Learning February 24, 2026 3 min read Article

Summary

The paper presents MANATEE, a novel defense mechanism for large language models (LLMs) against adversarial attacks, utilizing a lightweight diffusion approach to enhance safety without extensive retraining.

Why It Matters

As LLMs become increasingly integrated into various applications, ensuring their robustness against adversarial attacks is critical. MANATEE addresses a significant gap in current defense mechanisms by providing a method that does not require harmful training data or architectural changes, making it a practical solution for enhancing AI safety.

Key Takeaways

MANATEE offers a novel inference-time defense for LLMs against adversarial attacks.
The approach uses diffusion to project anomalous representations toward safe regions.
It achieves up to 100% reduction in attack success rates while maintaining model utility.
No harmful training data or architectural modifications are needed.
The method is applicable across various LLM architectures.

Computer Science > Cryptography and Security arXiv:2602.18782 (cs) [Submitted on 21 Feb 2026] Title:MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Authors:Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li, Maheep Chaudhary View a PDF of the paper titled MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs, by Chun Yan Ryan Kan and 6 other authors View PDF HTML (experimental) Abstract:Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.18782 [cs.CR] (or arXiv:2602.18...

Read Original Article

[2602.18782] MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Summary

Why It Matters

Key Takeaways

Related Articles

People anxious about deviating from what AI tells them to do?

ChatGPT on trial: A landmark test of AI liability in the practice of law

What if Claude purposefully made its own code leakable so that it would get leaked

Observer-Embedded Reality

No comments

Stay updated with AI News