[2502.17160] A Pragmatic Note on Evaluating Generative Models with Fréchet Inception Distance for Retinal Image Synthesis
Summary
This article discusses the limitations of using Fréchet Inception Distance (FID) as an evaluation metric for generative models in retinal image synthesis, emphasizing the need for task-specific evaluations.
Why It Matters
Understanding the limitations of FID in biomedical contexts is crucial for improving generative model evaluations. This paper highlights the necessity of integrating synthetic data into practical applications, enhancing the reliability of generative models in medical imaging.
Key Takeaways
- FID is commonly used but may not align with specific biomedical tasks.
- Task-specific evaluations are essential for assessing generative model performance.
- The paper examines retinal imaging modalities to illustrate FID's limitations.
- Incorporating synthetic data into downstream tasks can provide better evaluations.
- Awareness of these limitations can guide future research in generative models.
Computer Science > Computer Vision and Pattern Recognition arXiv:2502.17160 (cs) [Submitted on 24 Feb 2025 (v1), last revised 20 Feb 2026 (this version, v3)] Title:A Pragmatic Note on Evaluating Generative Models with Fréchet Inception Distance for Retinal Image Synthesis Authors:Yuli Wu, Fucheng Liu, Rüveyda Yilmaz, Henning Konermann, Peter Walter, Johannes Stegmaier View a PDF of the paper titled A Pragmatic Note on Evaluating Generative Models with Fr\'echet Inception Distance for Retinal Image Synthesis, by Yuli Wu and Fucheng Liu and R\"uveyda Yilmaz and Henning Konermann and Peter Walter and Johannes Stegmaier View PDF HTML (experimental) Abstract:Fréchet Inception Distance (FID), computed with an ImageNet pretrained Inception-v3 network, is widely used as a state-of-the-art evaluation metric for generative models. It assumes that feature vectors from Inception-v3 follow a multivariate Gaussian distribution and calculates the 2-Wasserstein distance based on their means and covariances. While FID effectively measures how closely synthetic data match real data in many image synthesis tasks, the primary goal in biomedical generative models is often to enrich training datasets ideally with corresponding annotations. For this purpose, the gold standard for evaluating generative models is to incorporate synthetic data into downstream task training, such as classification and segmentation, to pragmatically assess its performance. In this paper, we examine cases from retinal...