vLLM V0 to V1: Correctness Before Corrections in RL

Hugging Face Blog May 06, 2026 8 min read

About this article

A Blog post by ServiceNow-AI on Hugging Face

Back to Articles vLLM V0 to V1: Correctness Before Corrections in RL Enterprise Article Published May 6, 2026 Upvote - Rafael Pardinas rafapi-snow Follow ServiceNow-AI Ehsan Kamalloo ehsk Follow ServiceNow-AI PipelineRL uses vLLM as the inference engine for rollout generation. The inference engine samples tokens and returns token logprobs; the trainer uses those logprobs to compute policy ratios, KL, clip rate, entropy, and reward. Any discrepancy in how those logprobs are computed can change the training dynamics. This is the train-inference mismatch we needed to eliminate during the vLLM V0 to V1 migration. TL;DR. vLLM V1 matched our vLLM V0 reference after we fixed four things: processed rollout logprobs, V1-specific runtime defaults, the inflight weight-update path, and the fp32 lm_head used for the final projection. We fixed the backend behavior before changing the RL objective. The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1. Figure 1 shows the final result. The red run is the initial V1 attempt, and the green run is the final V1 run after the fixes described below. Figure 1. Trainer-side metrics for the vLLM V0 reference (blue), the initial vLLM V1 attempt (red), and the final vLLM V1 run after our fixes (green), including the fp32 lm_head. The final V1 run returns close to the V0 trajectory across clip rate, KL, entropy, and reward. Migration Objective vLLM V1 is a substantial rewrite of the V0 engine. Our migration target was therefore deliberately...

Originally published on May 06, 2026. Curated by AI News.

Llms

Researchers asked ChatGPT, Gemini and Claude which jobs are most exposed to AI. The chatbots wildly diagree

A study reveals that AI models disagree on which jobs are most vulnerable to automation, highlighting the unreliability of AI-generated e...

AI Tools & Products · 4 min · about 6 hours ago

Llms

I stopped treating ChatGPT like Google — and everything suddenly clicked

I stopped using ChatGPT like Google and started treating it like a thinking partner — here’s why that simple shift made the AI dramatical...

AI Tools & Products · 8 min · about 6 hours ago

Llms

Hackers abuse Google ads, Claude.ai chats to push Mac malware

AI Tools & Products · 6 min · about 6 hours ago

Llms

Does Claude dream of electric gavels? A federal case with Kansas connections sets an AI precedent.

AI Tools & Products · about 6 hours ago

vLLM V0 to V1: Correctness Before Corrections in RL

About this article

Related Articles

Researchers asked ChatGPT, Gemini and Claude which jobs are most exposed to AI. The chatbots wildly diagree

I stopped treating ChatGPT like Google — and everything suddenly clicked

Hackers abuse Google ads, Claude.ai chats to push Mac malware

Does Claude dream of electric gavels? A federal case with Kansas connections sets an AI precedent.

No comments

Stay updated with AI News