[2602.13851] Evaluating LLM-Generated ACSL Annotations for Formal Verification
Summary
This paper evaluates the effectiveness of LLM-generated ACSL annotations for formal verification in C programs, comparing multiple generation systems and their performance in automated settings.
Why It Matters
As software systems become increasingly complex, ensuring their reliability through formal specifications is vital. This study sheds light on the capabilities and limitations of automated annotation generation, which can streamline verification processes and improve software dependability.
Key Takeaways
- The study compares five different systems for generating ACSL annotations.
- Automated generation of formal specifications can enhance software verification processes.
- Results highlight both the strengths and weaknesses of LLMs in generating accurate annotations.
Computer Science > Software Engineering arXiv:2602.13851 (cs) [Submitted on 14 Feb 2026] Title:Evaluating LLM-Generated ACSL Annotations for Formal Verification Authors:Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan View a PDF of the paper titled Evaluating LLM-Generated ACSL Annotations for Formal Verification, by Arshad Beg and Diarmuid O'Donoghue and Rosemary Monahan View PDF HTML (experimental) Abstract:Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance. We conduct a controlled study on a recently released dataset of 506 C programs, repurposing it from interactive, developer-driven workflows to an automated evaluation setting. Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models--DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct. All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers, allowing a direct comparison of annotation quality, solver sensitivity, and proof stability. Our results provide new empirical evidence on the capabilities and limitations of automated ACSL generation, complementing prior survey-ba...