[2602.13372] MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

[2602.13372] MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

arXiv - Machine Learning 3 min read Article

Summary

The paper introduces MoralityGym, a benchmark for assessing hierarchical moral alignment in AI decision-making, utilizing 98 ethical dilemmas to evaluate agent behavior.

Why It Matters

As AI systems increasingly interact with complex human norms, understanding their moral alignment is crucial for ensuring ethical decision-making. MoralityGym provides a structured approach to evaluate this alignment, bridging AI safety, moral philosophy, and cognitive science.

Key Takeaways

  • MoralityGym introduces a novel framework for moral evaluation in AI.
  • The benchmark includes 98 ethical dilemmas modeled as trolley problems.
  • It separates task-solving from moral evaluation to enhance decision-making insights.
  • Baseline results highlight limitations in current Safe RL methods.
  • The work aims to improve the reliability and transparency of AI systems.

Computer Science > Artificial Intelligence arXiv:2602.13372 (cs) [Submitted on 13 Feb 2026] Title:MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents Authors:Simon Rosen, Siddarth Singh, Ebenezer Gelo, Helen Sarah Robertson, Ibrahim Suder, Victoria Williams, Benjamin Rosman, Geraud Nangue Tasse, Steven James View a PDF of the paper titled MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents, by Simon Rosen and 7 other authors View PDF HTML (experimental) Abstract:Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world ...

Related Articles

Llms

[R] The Lyra Technique — A framework for interpreting internal cognitive states in LLMs (Zenodo, open access)

We're releasing a paper on a new framework for reading and interpreting the internal cognitive states of large language models: "The Lyra...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] If you're building AI agents, logs aren't enough. You need evidence.

I have built a programmable governance layer for AI agents. I am considering to open source completely. Looking for feedback. Agent demos...

Reddit - Machine Learning · 1 min ·
[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis
Ai Safety

[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

arXiv - AI · 4 min ·
[2504.05995] NativQA Framework: Enabling LLMs and VLMs with Native, Local, and Everyday Knowledge
Llms

[2504.05995] NativQA Framework: Enabling LLMs and VLMs with Native, Local, and Everyday Knowledge

Abstract page for arXiv paper 2504.05995: NativQA Framework: Enabling LLMs and VLMs with Native, Local, and Everyday Knowledge

arXiv - AI · 4 min ·
More in Ai Safety: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime