[2602.00381] Modeling Image-Caption Rating from Comparative Judgments
About this article
Abstract page for arXiv paper 2602.00381: Modeling Image-Caption Rating from Comparative Judgments
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.00381 (cs) [Submitted on 30 Jan 2026 (v1), last revised 24 Mar 2026 (this version, v2)] Title:Modeling Image-Caption Rating from Comparative Judgments Authors:Kezia Minni, Qiang Zhang, Monoshiz Mahbub Khan, Zhe Yu View a PDF of the paper titled Modeling Image-Caption Rating from Comparative Judgments, by Kezia Minni and 3 other authors View PDF Abstract:Image caption rating is becoming increasingly important because computer-generated captions are used extensively for descriptive annotation. However, rating the accuracy of captions in describing images is time-consuming and subjective in nature. In contrast, it is often easier for people to compare (between two pairs) which image-caption pair better matches each other. In this study, we propose a machine learning framework that models such comparative judgments instead of direct ratings. The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings. Inspired by a state-of-the-art regression approach, we extracted visual and text features using a pre-trained ViLBERT model and tweaked the learning parameters of the baseline model to improve the model performance. This new regression model (with Kendall's $\tau_c=0.812$) outperformed the baseline model (with Kendall's $\tau_c=0.758$) on the VICR dataset. The same model structure was applied to the comparative learning framework. Trained on c...