[2604.00021] How Do Language Models Process Ethical Instructions?

[2604.00021] How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

arXiv - AI April 02, 2026 4 min read

About this article

Abstract page for arXiv paper 2604.00021: How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

Computer Science > Computation and Language arXiv:2604.00021 (cs) [Submitted on 11 Mar 2026] Title:How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models Authors:Hiroki Fukui View a PDF of the paper titled How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models, by Hiroki Fukui View PDF HTML (experimental) Abstract:Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principle...

Originally published on April 02, 2026. Curated by AI News.

Llms

[2512.02966] Lumos: Let there be Language Model System Certification

Abstract page for arXiv paper 2512.02966: Lumos: Let there be Language Model System Certification

arXiv - AI · 4 min · about 5 hours ago

Llms

[2602.00750] Bypassing Prompt Injection Detectors through Evasive Injections

Abstract page for arXiv paper 2602.00750: Bypassing Prompt Injection Detectors through Evasive Injections

arXiv - AI · 4 min · about 5 hours ago

Llms

[2511.08225] Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

Abstract page for arXiv paper 2511.08225: Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

arXiv - AI · 4 min · about 5 hours ago

Llms

[2511.20224] DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling

Abstract page for arXiv paper 2511.20224: DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling

arXiv - AI · 3 min · about 5 hours ago

[2604.00021] How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

About this article

Related Articles

[2512.02966] Lumos: Let there be Language Model System Certification

[2602.00750] Bypassing Prompt Injection Detectors through Evasive Injections

[2511.08225] Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

[2511.20224] DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling

No comments

Stay updated with AI News