[2602.13376] An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation
Summary
This article presents a novel reference-free evaluation framework for assessing the quality of flowchart image-to-code generation, utilizing automated metrics for continuous quality monitoring.
Why It Matters
As flowchart image-to-code generation becomes more prevalent in document processing, ensuring output quality without reference codes is crucial. This framework offers a practical solution for real-time evaluation, enhancing reliability in production environments.
Key Takeaways
- Introduces a reference-free evaluation framework for flowchart image-to-code generation.
- Employs two automated metrics: Recall_OCR and Precision_VE for quality assessment.
- Demonstrates strong correlation with ground-truth metrics, validating its reliability.
- Provides a unified quality score (F1_OCR-VE) for continuous monitoring.
- Addresses the challenge of quality evaluation in production settings without ground-truth references.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.13376 (cs) [Submitted on 13 Feb 2026] Title:An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation Authors:Giang Son Nguyen, Zi Pong Lim, Sarthak Ketanbhai Modi, Yon Shin Teo, Wenya Wang View a PDF of the paper titled An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation, by Giang Son Nguyen and 4 other authors View PDF HTML (experimental) Abstract:Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1...