[2602.15238] Closing the Distribution Gap in Adversarial Training for LLMs

[2602.15238] Closing the Distribution Gap in Adversarial Training for LLMs

arXiv - Machine Learning 3 min read Article

Summary

This article discusses a novel approach to adversarial training for large language models (LLMs), proposing Distributional Adversarial Training (DAT) to enhance robustness against vulnerabilities in prompt handling.

Why It Matters

As LLMs are increasingly integrated into various applications, their susceptibility to adversarial attacks poses significant risks. This research addresses a critical gap in existing adversarial training methods, offering a solution that could improve the reliability and safety of LLMs in real-world scenarios.

Key Takeaways

  • Current adversarial training methods inadequately cover data distributions, leading to vulnerabilities.
  • Distributional Adversarial Training (DAT) leverages diffusion models to enhance sample diversity.
  • DAT combines optimization over data distribution with continuous adversarial training for improved robustness.
  • The proposed method shows significantly higher adversarial robustness compared to existing techniques.
  • Addressing these vulnerabilities is crucial for the safe deployment of LLMs in sensitive applications.

Computer Science > Machine Learning arXiv:2602.15238 (cs) [Submitted on 16 Feb 2026] Title:Closing the Distribution Gap in Adversarial Training for LLMs Authors:Chengzhi Hu, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn View a PDF of the paper titled Closing the Distribution Gap in Adversarial Training for LLMs, by Chengzhi Hu and 4 other authors View PDF HTML (experimental) Abstract:Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods. Subjects: Machine Learning (cs.LG); Artificial In...

Related Articles

Llms

[D] How to break free from LLM's chains as a PhD student?

I didn't realize but over a period of one year i have become overreliant on ChatGPT to write code, I am a second year PhD student and don...

Reddit - Machine Learning · 1 min ·
Llms

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing

Anthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the tim...

Reddit - Machine Learning · 1 min ·
Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min ·
Llms

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

I want to be honest about something that happened to me because I think it is more common than people admit. Last month I hit a bug in a ...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime