[2508.06601] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

[2508.06601] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

arXiv - AI 4 min read Article

Summary

This paper explores how filtering pretraining data can enhance the tamper-resistance of open-weight large language models (LLMs), presenting a novel approach to AI safety.

Why It Matters

With the rise of open-weight AI systems, ensuring their safety against tampering is critical. This research introduces a new method for data curation that significantly improves the resilience of LLMs to adversarial attacks, addressing a major gap in AI risk management.

Key Takeaways

  • Filtering dual-use topics from training data can prevent harmful capabilities in LLMs.
  • The proposed multi-stage data filtering pipeline shows substantial resistance to adversarial fine-tuning.
  • Filtered models outperform existing safety methods by over an order of magnitude without degrading unrelated capabilities.
  • Despite filtering, models can still leverage dangerous knowledge if provided in context, highlighting the need for layered defenses.
  • Establishing pretraining data curation as a defense layer is crucial for the future of open-weight AI systems.

Computer Science > Machine Learning arXiv:2508.06601 (cs) [Submitted on 8 Aug 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs Authors:Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman View a PDF of the paper titled Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs, by Kyle O'Brien and 9 other authors View PDF HTML (experimental) Abstract:Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models f...

Related Articles

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?
Llms

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?

AI Tools & Products · 12 min ·
Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute
Llms

Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute

AI Tools & Products · 3 min ·
How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'
Llms

How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'

AI Tools & Products · 9 min ·
Llms

Codex and Claude Code Can Work Together

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime