[2602.18492] Vibe Coding on Trial: Operating Characteristics of Unanimous LLM Juries
Summary
The paper explores the effectiveness of unanimous committees of Large Language Models (LLMs) in evaluating SQL queries, revealing insights into their performance and safety in coding tasks.
Why It Matters
As LLMs become integral in coding workflows, ensuring the reliability of their outputs is crucial. This study provides a framework for assessing model performance, which can enhance safety in automated coding processes and reduce errors in software development.
Key Takeaways
- Unanimous committees of LLMs can significantly reduce false accepts in SQL query evaluations.
- The composition of the committee impacts performance, with smaller groups showing better results.
- Benchmarking against established models provides a clear baseline for evaluating LLM effectiveness.
Computer Science > Databases arXiv:2602.18492 (cs) [Submitted on 12 Feb 2026] Title:Vibe Coding on Trial: Operating Characteristics of Unanimous LLM Juries Authors:Muhammad Aziz Ullah, Abdul Serwadda View a PDF of the paper titled Vibe Coding on Trial: Operating Characteristics of Unanimous LLM Juries, by Muhammad Aziz Ullah and Abdul Serwadda View PDF HTML (experimental) Abstract:Large Language Models (LLMs) are now good enough at coding that developers can describe intent in plain language and let the tool produce the first code draft, a workflow increasingly built into tools like GitHub Copilot, Cursor, and Replit. What is missing is a reliable way to tell which model written queries are safe to accept without sending everything to a human. We study the application of an LLM jury to run this review step. We first benchmark 15 open models on 82 MySQL text to SQL tasks using an execution grounded protocol to get a clean baseline of which models are strong. From the six best models we build unanimous committees of sizes 1 through 6 that see the prompt, schema, and candidate SQL and accept it only when every member says it is correct. This rule matches safety first deployments where false accepts are more costly than false rejects. We measure true positive rate, false positive rate and Youden J and we also look at committees per generator. Our results show that single model judges are uneven, that small unanimous committees of strong models can cut false accepts while still...