Vision Language Model Alignment in TRL ⚡️
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles Vision Language Model Alignment in TRL ⚡️ Published August 7, 2025 Update on GitHub Upvote 109 +103 Sergio Paniego sergiopaniego Follow merve merve Follow Quentin Gallouédec qgallouedec Follow Kashif Rasul kashif Follow Aritra Roy Gosthipaty ariG23498 Follow Introduction Vision Language Models (VLMs) are getting stronger, but aligning them to human preferences still matters. In TRL, we already showed how to post-train VLMs with Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This time, we’re going further. tl;dr Here’s what’s new in TRL: Mixed Preference Optimization (MPO) Group Relative Policy Optimization (GRPO) Group Sequence Policy Optimization (GSPO) (a variant of GRPO) These go beyond pairwise DPO, extracting richer signals from preference data and scaling better with modern VLMs. We’ve also extended existing methods to support VLMs: Reinforce Leave One Out (RLOO) Online Direct Preference Optimization (Online DPO) This enables more efficient and scalable multimodal alignment. Finally: Native Supervised Fine-tuning support for Vision Language Models Training scripts and demo notebooks to help you get started quickly Table of Contents Multimodal Alignment for VLMs in TRL ⚡️ Introduction Alignment for Vision Language Models Mixed Preference Optimization (MPO) Multimodal Group Relative Policy Optimization (GRPO) Group Sequence Policy Optimization (GSPO) Comparison Further Extensions for VLMs Reinforce Leave One Out (RLOO) Online Di...