[2603.21280] WARBENCH: A Comprehensive Benchmark for Evaluating LLMs in Military Decision-Making
About this article
Abstract page for arXiv paper 2603.21280: WARBENCH: A Comprehensive Benchmark for Evaluating LLMs in Military Decision-Making
Computer Science > Computers and Society arXiv:2603.21280 (cs) [Submitted on 22 Mar 2026] Title:WARBENCH: A Comprehensive Benchmark for Evaluating LLMs in Military Decision-Making Authors:Zongjie Li, Chaozheng Wang, Yuchong Xie, Pingchuan Ma, Shuai Wang View a PDF of the paper titled WARBENCH: A Comprehensive Benchmark for Evaluating LLMs in Military Decision-Making, by Zongjie Li and 4 other authors View PDF HTML (experimental) Abstract:Large Language Models are increasingly being considered for deployment in safety-critical military applications. However, current benchmarks suffer from structural blindspots that systematically overestimate model capabilities in real-world tactical scenarios. Existing frameworks typically ignore strict legal constraints based on International Humanitarian Law (IHL), omit edge computing limitations, lack robustness testing for fog of war, and inadequately evaluate explicit reasoning. To address these vulnerabilities, we present WARBENCH, a comprehensive evaluation framework establishing a foundational tactical baseline alongside four distinct stress testing dimensions. Through a large scale empirical evaluation of nine leading models on 136 high-fidelity historical scenarios, we reveal severe structural flaws. First, baseline tactical reasoning systematically collapses under complex terrain and high force asymmetry. Second, while state of the art closed source models maintain functional compliance, edge-optimized small models expose extrem...