[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance
About this article
Hello everyone. I trained Qwen2.5-1.5b-Instruct with both RLVR and SFT on the GSM8K dataset and compared the results across GSM8K and MATH benchmarks. For those unfamiliar: SFT (Supervised Fine-tuning): Standard next-token prediction training on labeled data. RLVR (Reinforcement Learning with Verifiable Rewards): The training approach behind DeepSeek-R1. The model is reinforced to produce responses that earn higher rewards from a verifiable signal (e.g., correct math answers). This is what en...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket