[2603.28781] When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

[2603.28781] When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2603.28781: When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2603.28781 (cs) [Submitted on 17 Mar 2026 (v1), last revised 4 Apr 2026 (this version, v2)] Title:When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry Authors:Michael Bidollahkhani, Freja Nordsiek, Julian M. Kunkel View a PDF of the paper titled When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry, by Michael Bidollahkhani and 2 other authors View PDF HTML (experimental) Abstract:GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity. This paper proposes an observability-aware early-warning framework that jointly models (i) utilization-aware thermal drift signatures in GPU telemetry and (ii) monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance. The framework is evaluated on production telemetry from GPU nodes at GWDG, where GPU, node, monitoring, and scheduler signals can be correlated. Results show that detachment failure...

Originally published on April 07, 2026. Curated by AI News.

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
[2603.10047] Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction
Llms

[2603.10047] Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Abstract page for arXiv paper 2603.10047: Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination ...

arXiv - AI · 4 min ·
[2512.18388] Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models
Machine Learning

[2512.18388] Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models

Abstract page for arXiv paper 2512.18388: Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creatio...

arXiv - AI · 4 min ·
[2512.18470] SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Ai Infrastructure

[2512.18470] SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Abstract page for arXiv paper 2512.18470: SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

arXiv - AI · 4 min ·
More in Ai Infrastructure: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime