[2602.22570] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation
Summary
The paper discusses the evaluation challenges in text-to-image generation, focusing on classifier-free guidance (CFG) and proposing a new evaluation framework to address biases in current methods.
Why It Matters
This research highlights critical flaws in the evaluation of text-to-image generation models, which can lead to misleading results. By proposing a new evaluation framework, it aims to improve the reliability of model assessments and guide future research directions in generative AI.
Key Takeaways
- Current evaluation methods for text-to-image generation exhibit biases that can misrepresent model performance.
- The proposed GA-Eval framework offers a more accurate assessment of guidance methods in generative models.
- Simply increasing CFG scales can yield misleading quantitative scores despite potential image quality degradation.
- The study empirically evaluates eight diffusion guidance methods, revealing significant insights into their effectiveness.
- The findings encourage a reevaluation of existing paradigms in the generative AI community.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.22570 (cs) [Submitted on 26 Feb 2026] Title:Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation Authors:Dian Xie, Shitong Shao, Lichen Bai, Zikai Zhou, Bojun Cheng, Shuo Yang, Jun Wu, Zeke Xie View a PDF of the paper titled Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation, by Dian Xie and 7 other authors View PDF HTML (experimental) Abstract:Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to ...