[2602.18782] MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

[2602.18782] MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

arXiv - Machine Learning 3 min read Article

Summary

The paper presents MANATEE, a novel defense mechanism for large language models (LLMs) against adversarial attacks, utilizing a lightweight diffusion approach to enhance safety without extensive retraining.

Why It Matters

As LLMs become increasingly integrated into various applications, ensuring their robustness against adversarial attacks is critical. MANATEE addresses a significant gap in current defense mechanisms by providing a method that does not require harmful training data or architectural changes, making it a practical solution for enhancing AI safety.

Key Takeaways

  • MANATEE offers a novel inference-time defense for LLMs against adversarial attacks.
  • The approach uses diffusion to project anomalous representations toward safe regions.
  • It achieves up to 100% reduction in attack success rates while maintaining model utility.
  • No harmful training data or architectural modifications are needed.
  • The method is applicable across various LLM architectures.

Computer Science > Cryptography and Security arXiv:2602.18782 (cs) [Submitted on 21 Feb 2026] Title:MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Authors:Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li, Maheep Chaudhary View a PDF of the paper titled MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs, by Chun Yan Ryan Kan and 6 other authors View PDF HTML (experimental) Abstract:Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.18782 [cs.CR]   (or arXiv:2602.18...

Related Articles

Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
Llms

What if Claude purposefully made its own code leakable so that it would get leaked

What if Claude leaked itself by socially and architecturally engineering itself to be leaked by a dumb human submitted by /u/smurfcsgoawp...

Reddit - Artificial Intelligence · 1 min ·
Llms

Observer-Embedded Reality

Observer-Embedded Reality Consciousness, Complexity, Meaning, and the Limits of Human Knowledge A Conceptual Philosophy-of-Science Paper ...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime