Llms Machine Learning Ai Safety

[2602.12506] On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

arXiv - Machine Learning February 16, 2026 4 min read Article

Summary

This article examines the robustness and chain-of-thought consistency of reinforcement learning (RL) fine-tuned vision language models (VLMs), highlighting their vulnerabilities and the impact of training methods on model reliability.

Why It Matters

As AI models become integral in reasoning tasks, understanding their limitations is crucial for developing more reliable systems. This research sheds light on the trade-offs between accuracy and robustness, emphasizing the need for improved training protocols that ensure both performance and faithfulness in model outputs.

Key Takeaways

RL fine-tuning enhances VLMs but introduces vulnerabilities.
Textual perturbations significantly affect model robustness and confidence.
Accuracy improvements can lead to a trade-off with reliability and faithfulness.
Adversarial augmentation alone does not guarantee robustness.
Faithfulness-aware rewards can help align reasoning with outputs.

Computer Science > Machine Learning arXiv:2602.12506 (cs) [Submitted on 13 Feb 2026] Title:On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs Authors:Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, Arnab Mondal View a PDF of the paper titled On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs, by Rosie Zhao and 7 other authors View PDF HTML (experimental) Abstract:Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations--misleading captions or incorrect chain-of-thought (CoT) traces--cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. Entropy-based metrics further show that these perturbations reshape model uncertainty and probability mass on the correct option, exposing model-specific trends in miscalibration. To better understand these vulnerabilities, we further analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off: fine-tuning raises benchmark accuracy, but can simultaneously erode the re...

Read Original Article