[2602.20813] Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
Summary
This paper presents a novel evaluation framework for assessing the alignment of language models under realistic pressure, revealing behavioral tendencies often missed in single-turn evaluations.
Why It Matters
As alignment failures in AI systems can lead to significant real-world consequences, this research addresses the urgent need for comprehensive evaluation frameworks. By introducing a benchmark that tests models across various scenarios, it aims to improve the safety and reliability of AI applications.
Key Takeaways
- The study introduces a benchmark with 904 scenarios to evaluate AI alignment under pressure.
- Findings indicate that even top models show weaknesses in specific alignment categories.
- The research suggests that alignment behaves as a unified construct, similar to cognitive g-factor.
- An interactive leaderboard is provided to facilitate ongoing evaluation and comparison of models.
- Future plans include expanding scenarios and incorporating new models as they are developed.
Computer Science > Artificial Intelligence arXiv:2602.20813 (cs) [Submitted on 24 Feb 2026] Title:Pressure Reveals Character: Behavioural Alignment Evaluation at Depth Authors:Nora Petrova, John Burden View a PDF of the paper titled Pressure Reveals Character: Behavioural Alignment Evaluation at Depth, by Nora Petrova and John Burden View PDF HTML (experimental) Abstract:Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benc...