[2602.22291] Manifold of Failure: Behavioral Attraction Basins in Language Models
Summary
This paper introduces a framework for mapping the 'Manifold of Failure' in language models, identifying vulnerability regions and their topological characteristics using a quality diversity approach.
Why It Matters
Understanding the vulnerabilities in language models is critical for AI safety. This research shifts the focus from merely identifying failures to comprehensively mapping the underlying structures of these failures, which can inform better model design and safety protocols.
Key Takeaways
- Introduces a framework for mapping failure regions in language models.
- Utilizes MAP-Elites to achieve significant behavioral coverage and discover distinct vulnerability niches.
- Reveals model-specific topological signatures that inform about safety landscapes.
Computer Science > Machine Learning arXiv:2602.22291 (cs) [Submitted on 25 Feb 2026] Title:Manifold of Failure: Behavioral Attraction Basins in Language Models Authors:Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala, Idan Habler, AmmarnAl-Kahfah, Ken Huang, Blake Gatto View a PDF of the paper titled Manifold of Failure: Behavioral Attraction Basins in Language Models, by Sarthak Munshi and 6 other authors View PDF HTML (experimental) Abstract:While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B ...