[2603.28781] When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry
About this article
Abstract page for arXiv paper 2603.28781: When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2603.28781 (cs) [Submitted on 17 Mar 2026 (v1), last revised 4 Apr 2026 (this version, v2)] Title:When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry Authors:Michael Bidollahkhani, Freja Nordsiek, Julian M. Kunkel View a PDF of the paper titled When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry, by Michael Bidollahkhani and 2 other authors View PDF HTML (experimental) Abstract:GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity. This paper proposes an observability-aware early-warning framework that jointly models (i) utilization-aware thermal drift signatures in GPU telemetry and (ii) monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance. The framework is evaluated on production telemetry from GPU nodes at GWDG, where GPU, node, monitoring, and scheduler signals can be correlated. Results show that detachment failure...