[2510.19060] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Summary
The paper introduces PoSh, a new metric using scene graphs to enhance the evaluation of detailed image descriptions by LLMs, outperforming existing metrics.
Why It Matters
As vision-language models evolve, accurate evaluation metrics are crucial for assessing their performance in generating detailed image descriptions. PoSh addresses limitations of current metrics, providing a more nuanced approach that aligns better with human judgment, thus advancing the field of AI in image understanding.
Key Takeaways
- PoSh uses scene graphs to improve evaluation of detailed image descriptions.
- It offers better correlation with human judgments compared to existing metrics.
- The new DOCENT dataset provides a challenging benchmark for evaluating image descriptions.
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.19060 (cs) [Submitted on 21 Oct 2025 (v1), last revised 26 Feb 2026 (this version, v3)] Title:PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions Authors:Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown View a PDF of the paper titled PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions, by Amith Ananthram and 9 other authors View PDF HTML (experimental) Abstract:While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork...