[2603.03378] AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis
About this article
Abstract page for arXiv paper 2603.03378: AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis
Computer Science > Machine Learning arXiv:2603.03378 (cs) [Submitted on 3 Mar 2026] Title:AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis Authors:Pei Yang, Wanyi Chen, Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Bill Shi, Lynn Ai, Eric Yang View a PDF of the paper titled AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis, by Pei Yang and 9 other authors View PDF HTML (experimental) Abstract:Large language model (LLM) agents offer a promising data-driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment is constrained by three challenges: restricted access to proprietary data, unsafe action execution under permission-governed environments, and the inability of closed systems to improve from failures. We present AOI (Autonomous Operations Intelligence), a trainable multi-agent framework formulating automated operations as a structured trajectory learning problem under security constraints. Our approach integrates three key components. First, a trainable diagnostic system applies Group Relative Policy Optimization (GRPO) to distill expert-level knowledge into locally deployed open-source models, enabling preference-based learning without exposing sensitive data. Second, a read-write separated execution architecture decomposes operational trajectories into observation, reasoning, and action phases, allowing safe learning while preventing unau...