[2602.22831] Moral Preferences of LLMs Under Directed Contextual Influence
Summary
This paper explores how contextual influences affect the moral decision-making of large language models (LLMs) in scenarios akin to trolley problems, revealing significant shifts in outcomes based on contextual cues.
Why It Matters
Understanding how LLMs respond to contextual influences is crucial for developing ethical AI systems. This research highlights the complexities of moral decision-making in AI, emphasizing the need for improved evaluation methods that account for contextual factors.
Key Takeaways
- Contextual influences can significantly alter LLM decisions, even with superficial relevance.
- Baseline preferences do not reliably predict how models will respond to contextual cues.
- Models may exhibit unexpected decision shifts, sometimes counter to their stated neutrality.
- Incorporating reasoning can reduce sensitivity to context but amplify biases from few-shot examples.
- Controlled context manipulations are necessary for more accurate moral evaluations of LLMs.
Computer Science > Machine Learning arXiv:2602.22831 (cs) [Submitted on 26 Feb 2026] Title:Moral Preferences of LLMs Under Directed Contextual Influence Authors:Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov View a PDF of the paper titled Moral Preferences of LLMs Under Directed Contextual Influence, by Phil Blandfort and 5 other authors View PDF HTML (experimental) Abstract:Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choice...