Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]
About this article
So, yesterday run was a success and I did get an avg rollout length of about 64 tokens as attached in the image! This was with quality_reward + length_penalty (more info below!) Next, I'll be going with length penalty as the reward and with the mistake of counting characters as tokens fixed and see if there is any gaming the system stuff or degraded outputs! The rewards I used were 2: length_penalty : basically, -abs(response_length - MAX_LENGTH) quality_reward: ROUGE-L, which is basically LC...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket