Llms Machine Learning Ai Safety

[2508.06601] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

arXiv - AI February 18, 2026 4 min read Article

Summary

This paper explores how filtering pretraining data can enhance the tamper-resistance of open-weight large language models (LLMs), presenting a novel approach to AI safety.

Why It Matters

With the rise of open-weight AI systems, ensuring their safety against tampering is critical. This research introduces a new method for data curation that significantly improves the resilience of LLMs to adversarial attacks, addressing a major gap in AI risk management.

Key Takeaways

Filtering dual-use topics from training data can prevent harmful capabilities in LLMs.
The proposed multi-stage data filtering pipeline shows substantial resistance to adversarial fine-tuning.
Filtered models outperform existing safety methods by over an order of magnitude without degrading unrelated capabilities.
Despite filtering, models can still leverage dangerous knowledge if provided in context, highlighting the need for layered defenses.
Establishing pretraining data curation as a defense layer is crucial for the future of open-weight AI systems.

Computer Science > Machine Learning arXiv:2508.06601 (cs) [Submitted on 8 Aug 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs Authors:Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman View a PDF of the paper titled Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs, by Kyle O'Brien and 9 other authors View PDF HTML (experimental) Abstract:Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models f...

Read Original Article