[2604.00726] Exploring Silent Data Corruption as a Reliability Challenge in LLM Training
About this article
Abstract page for arXiv paper 2604.00726: Exploring Silent Data Corruption as a Reliability Challenge in LLM Training
Computer Science > Machine Learning arXiv:2604.00726 (cs) [Submitted on 1 Apr 2026] Title:Exploring Silent Data Corruption as a Reliability Challenge in LLM Training Authors:Anton Altenbernd, Philipp Wiesner, Odej Kao View a PDF of the paper titled Exploring Silent Data Corruption as a Reliability Challenge in LLM Training, by Anton Altenbernd and 2 other authors View PDF HTML (experimental) Abstract:As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress. This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter divergence. Building on the observed corruption signatures, we propose a lightweight detection method that identifies potentially harmful parameter updates. Experiments on LLaMA models with 60M, 350M...