[2506.13593] Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

[2506.13593] Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

arXiv - Machine Learning 4 min read Article

Summary

This paper introduces a novel safety measure, time-to-unsafe-sampling, for evaluating generative models, focusing on predicting unsafe outputs in large language models (LLMs).

Why It Matters

As generative AI models become more prevalent, ensuring their safety is critical. This research offers a new metric for assessing the risk of unsafe outputs, which is essential for developers and researchers aiming to improve AI safety protocols. The proposed method enhances sample efficiency and provides rigorous coverage guarantees, making it a valuable contribution to the field.

Key Takeaways

  • Introduces time-to-unsafe-sampling as a safety measure for LLMs.
  • Frames the estimation problem using survival analysis for better accuracy.
  • Proposes a calibration technique for lower predictive bounds on unsafe sampling.
  • Implements an optimized sampling-budget allocation to enhance efficiency.
  • Demonstrates practical utility through experiments on synthetic and real data.

Computer Science > Machine Learning arXiv:2506.13593 (cs) [Submitted on 16 Jun 2025 (v1), last revised 16 Feb 2026 (this version, v5)] Title:Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs Authors:Hen Davidov, Shai Feldman, Gilad Freidkin, Yaniv Romano View a PDF of the paper titled Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs, by Hen Davidov and 3 other authors View PDF HTML (experimental) Abstract:We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our meth...

Related Articles

Llms

I am seeing Claude everywhere

Every single Instagram reel or TikTok I scroll i see people mentioning Claude and glazing it like it’s some kind of master tool that’s be...

Reddit - Artificial Intelligence · 1 min ·
Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime