[2602.21368] Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Summary
This paper presents a method for certifying the reliability of black-box AI systems using self-consistency sampling and conformal calibration, providing a quantifiable reliability level for AI outputs.
Why It Matters
As AI systems become increasingly integrated into critical applications, ensuring their reliability is paramount. This research offers a framework that quantifies trust in AI outputs, which is essential for practitioners in various fields relying on AI decision-making.
Key Takeaways
- Introduces a reliability certification method for black-box AI systems.
- Utilizes self-consistency sampling to reduce uncertainty in AI outputs.
- Conformal calibration ensures correctness of outputs regardless of model errors.
- Demonstrates effectiveness across multiple benchmarks and AI models.
- Offers significant cost reductions in API usage through sequential stopping.
Computer Science > Machine Learning arXiv:2602.21368 (cs) [Submitted on 24 Feb 2026] Title:Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration Authors:Charafeddine Mouzouni View a PDF of the paper titled Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration, by Charafeddine Mouzouni View PDF HTML (experimental) Abstract:Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%. Comments: Subjects: Machine Lea...