[2602.22291] Manifold of Failure: Behavioral Attraction Basins in Language Models

[2602.22291] Manifold of Failure: Behavioral Attraction Basins in Language Models

arXiv - Machine Learning 4 min read Article

Summary

This paper introduces a framework for mapping the 'Manifold of Failure' in language models, identifying vulnerability regions and their topological characteristics using a quality diversity approach.

Why It Matters

Understanding the vulnerabilities in language models is critical for AI safety. This research shifts the focus from merely identifying failures to comprehensively mapping the underlying structures of these failures, which can inform better model design and safety protocols.

Key Takeaways

  • Introduces a framework for mapping failure regions in language models.
  • Utilizes MAP-Elites to achieve significant behavioral coverage and discover distinct vulnerability niches.
  • Reveals model-specific topological signatures that inform about safety landscapes.

Computer Science > Machine Learning arXiv:2602.22291 (cs) [Submitted on 25 Feb 2026] Title:Manifold of Failure: Behavioral Attraction Basins in Language Models Authors:Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala, Idan Habler, AmmarnAl-Kahfah, Ken Huang, Blake Gatto View a PDF of the paper titled Manifold of Failure: Behavioral Attraction Basins in Language Models, by Sarthak Munshi and 6 other authors View PDF HTML (experimental) Abstract:While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B ...

Related Articles

Llms

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully ope...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

I've been reviewing how various AI memory systems evaluate their performance and noticed a fundamental issue with cross-system comparison...

Reddit - Machine Learning · 1 min ·
Shifting to AI model customization is an architectural imperative | MIT Technology Review
Llms

Shifting to AI model customization is an architectural imperative | MIT Technology Review

In the early days of large language models (LLMs), we grew accustomed to massive 10x jumps in reasoning and coding capability with every ...

MIT Technology Review · 6 min ·
Llms

Artificial intelligence will always depends on human otherwise it will be obsolete.

I was looking for a tool for my specific need. There was not any. So i started to write the program in python, just basic structure. Then...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime