[2605.00245] ARMOR 2025: A Military-Aligned Benchmark for Evaluating

[2605.00245] ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

arXiv - AI May 04, 2026 4 min read

About this article

Abstract page for arXiv paper 2605.00245: ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

Computer Science > Artificial Intelligence arXiv:2605.00245 (cs) [Submitted on 30 Apr 2026] Title:ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts Authors:Sydney Johns, Heng Jin, Chaoyu Zhang, Y. Thomas Hou, Wenjing Lou View a PDF of the paper titled ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts, by Sydney Johns and 4 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test whether models follow the legal and ethical rules that govern real military operations. To address this gap, we introduce ARMOR 2025, a military aligned safety benchmark grounded in three core military doctrines the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. We extract doctrinal text from these sources and generate multiple choice questions that preserve the intended meaning of each rule. The benchmark is organized through a taxonomy informed by the Observe Orient Decide Act (OODA) decision making framework. This struc...

Originally published on May 04, 2026. Curated by AI News.

Llms

Researchers asked ChatGPT, Gemini and Claude which jobs are most exposed to AI. The chatbots wildly diagree

A study reveals that AI models disagree on which jobs are most vulnerable to automation, highlighting the unreliability of AI-generated e...

AI Tools & Products · 4 min · about 6 hours ago

Llms

I stopped treating ChatGPT like Google — and everything suddenly clicked

I stopped using ChatGPT like Google and started treating it like a thinking partner — here’s why that simple shift made the AI dramatical...

AI Tools & Products · 8 min · about 6 hours ago

Llms

Hackers abuse Google ads, Claude.ai chats to push Mac malware

AI Tools & Products · 6 min · about 6 hours ago

Llms

Does Claude dream of electric gavels? A federal case with Kansas connections sets an AI precedent.

AI Tools & Products · about 6 hours ago

[2605.00245] ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

About this article

Related Articles

Researchers asked ChatGPT, Gemini and Claude which jobs are most exposed to AI. The chatbots wildly diagree

I stopped treating ChatGPT like Google — and everything suddenly clicked

Hackers abuse Google ads, Claude.ai chats to push Mac malware

Does Claude dream of electric gavels? A federal case with Kansas connections sets an AI precedent.

No comments

Stay updated with AI News