[2604.00726] Exploring Silent Data Corruption as a Reliability

[2604.00726] Exploring Silent Data Corruption as a Reliability Challenge in LLM Training

arXiv - Machine Learning April 02, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.00726: Exploring Silent Data Corruption as a Reliability Challenge in LLM Training

Computer Science > Machine Learning arXiv:2604.00726 (cs) [Submitted on 1 Apr 2026] Title:Exploring Silent Data Corruption as a Reliability Challenge in LLM Training Authors:Anton Altenbernd, Philipp Wiesner, Odej Kao View a PDF of the paper titled Exploring Silent Data Corruption as a Reliability Challenge in LLM Training, by Anton Altenbernd and 2 other authors View PDF HTML (experimental) Abstract:As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress. This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter divergence. Building on the observed corruption signatures, we propose a lightweight detection method that identifies potentially harmful parameter updates. Experiments on LLaMA models with 60M, 350M...

Originally published on April 02, 2026. Curated by AI News.

Llms

Agents that write their own code at runtime and vote on capabilities, no human in the loop

hollowOS just hit v4.4 and I added something that I haven’t seen anyone else do. Previous versions gave you an OS for agents: structured ...

Reddit - Artificial Intelligence · 1 min · 15 minutes ago

Llms

Google Maps can now write captions for your photos using AI | TechCrunch

Gemini can now create captions when users are looking to share a photo or video.

TechCrunch - AI · 4 min · about 2 hours ago

Llms

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

submitted by /u/PatienceHistorical70 [link] [comments]

Reddit - Machine Learning · 1 min · about 3 hours ago

Llms

Stop Overcomplicating AI Workflows. This Is the Simple Framework

I’ve been working on building an agentic AI workflow system for business use cases and one thing became very clear very quickly. This is ...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

[2604.00726] Exploring Silent Data Corruption as a Reliability Challenge in LLM Training

About this article

Related Articles

Agents that write their own code at runtime and vote on capabilities, no human in the loop

Google Maps can now write captions for your photos using AI | TechCrunch

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Stop Overcomplicating AI Workflows. This Is the Simple Framework

No comments

Stay updated with AI News