[2604.00021] How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
About this article
Abstract page for arXiv paper 2604.00021: How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
Computer Science > Computation and Language arXiv:2604.00021 (cs) [Submitted on 11 Mar 2026] Title:How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models Authors:Hiroki Fukui View a PDF of the paper titled How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models, by Hiroki Fukui View PDF HTML (experimental) Abstract:Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principle...