[2603.29231] Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
About this article
Abstract page for arXiv paper 2603.29231: Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
Computer Science > Artificial Intelligence arXiv:2603.29231 (cs) [Submitted on 31 Mar 2026] Title:Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents Authors:Aaditya Khanal, Yangyang Tao, Junxiu Zhou View a PDF of the paper titled Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents, by Aaditya Khanal and 2 other authors View PDF HTML (experimental) Abstract:Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified -- SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier -- high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long horizons; (4) frontier models have th...