[2602.18520] Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams
Summary
The paper presents Sketch2Feedback, a framework that enhances feedback on student-drawn STEM diagrams by integrating grammar rules to reduce hallucination in AI models.
Why It Matters
This research addresses a critical challenge in STEM education by improving the reliability of AI-generated feedback on student diagrams. The framework's ability to minimize hallucination rates while providing actionable insights can significantly enhance learning outcomes and trust in AI tools in educational settings.
Key Takeaways
- Sketch2Feedback employs a grammar-in-the-loop approach to enhance feedback accuracy.
- The framework reduces hallucination rates in AI feedback, improving trustworthiness.
- Evaluation shows that the grammar-based method provides more actionable insights than traditional end-to-end models.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18520 (cs) [Submitted on 19 Feb 2026] Title:Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams Authors:Aayam Bansal View a PDF of the paper titled Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams, by Aayam Bansal View PDF HTML (experimental) Abstract:Providing timely, rubric-aligned feedback on student-drawn diagrams is a persistent challenge in STEM education. While large multimodal models (LMMs) can jointly parse images and generate explanations, their tendency to hallucinate undermines trust in classroom deployments. We present Sketch2Feedback, a grammar-in-the-loop framework that decomposes the problem into four stages -- hybrid perception, symbolic graph construction, constraint checking, and constrained VLM feedback -- so that the language model verbalizes only violations verified by an upstream rule engine. We evaluate on two synthetic micro-benchmarks, FBD-10 (free-body diagrams) and Circuit-10 (circuit schematics), each with 500 images spanning standard and hard noise augmentation tiers, comparing our pipeline against end-to-end LMMs (LLaVA-1.5-7B, Qwen2-VL-7B), a vision-only detector, a YOLOv8-nano learned detector, and an ensemble oracle. On n=100 test samples per benchmark with 95% bootstrap CIs, results are mixed and instructive: Qwen2-VL-7B achieves the highest micro-F1 on both FBDs (0.570) a...