Llms Machine Learning Ai Safety

[2511.18721] Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

arXiv - Machine Learning February 23, 2026 3 min read Article

Summary

This paper introduces a probabilistic framework for certifying defenses against jailbreaking attacks on LLMs, addressing limitations of the existing SmoothLLM defense method.

Why It Matters

As AI systems become more integrated into various applications, ensuring their safety against exploitation is crucial. This research provides a more realistic certification method that enhances trust in LLMs, which is vital for their secure deployment in real-world scenarios.

Key Takeaways

Introduces the (k, ε)-unstable framework for better safety guarantees.
Addresses limitations of the existing SmoothLLM defense against jailbreaking.
Provides actionable thresholds for practitioners to enhance LLM safety.
Incorporates empirical models to improve the reliability of safety certificates.
Contributes to the broader challenge of secure AI deployment.

Computer Science > Machine Learning arXiv:2511.18721 (cs) [Submitted on 24 Nov 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM Authors:Adarsh Kumarappan, Ayushi Mehrotra View a PDF of the paper titled Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM, by Adarsh Kumarappan and 1 other authors View PDF HTML (experimental) Abstract:The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict "k-unstable" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, "(k, $\varepsilon$)-unstable," to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safet...

Read Original Article