[2602.14989] ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery
Summary
ThermEval introduces a benchmark for evaluating vision-language models on thermal imagery, highlighting their limitations in temperature-grounded reasoning and the need for specialized assessment tools.
Why It Matters
As thermal imaging becomes increasingly important in various fields like surveillance and medical screening, understanding how vision-language models perform with thermal data is crucial. This benchmark aims to improve model performance in scenarios where RGB imagery fails, ultimately enhancing applications in critical areas.
Key Takeaways
- ThermEval-B benchmark includes 55,000 thermal visual question-answer pairs.
- Existing models struggle with temperature-grounded reasoning and colormap transformations.
- ThermEval-D dataset provides dense per-pixel temperature maps with semantic annotations.
- Current evaluation methods are inadequate for thermal imagery, necessitating dedicated benchmarks.
- Results indicate minimal improvements from prompting or supervised fine-tuning.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14989 (cs) [Submitted on 16 Feb 2026] Title:ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery Authors:Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, Nipun Batra View a PDF of the paper titled ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery, by Ayush Shrivastava and 4 other authors View PDF HTML (experimental) Abstract:Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently...