[2604.08987] PilotBench: A Benchmark for General Aviation Agents with Safety Constraints
About this article
Abstract page for arXiv paper 2604.08987: PilotBench: A Benchmark for General Aviation Agents with Safety Constraints
Computer Science > Artificial Intelligence arXiv:2604.08987 (cs) [Submitted on 10 Apr 2026] Title:PilotBench: A Benchmark for General Aviation Agents with Safety Constraints Authors:Yalun Wu, Haotian Liu, Zhoujun Li, Boyang Wang View a PDF of the paper titled PilotBench: A Benchmark for General Aviation Agents with Safety Constraints, by Yalun Wu and 3 other authors View PDF HTML (experimental) Abstract:As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86--89% instruction-following at the cost of ...