I implemented DPO from the paper and the reward margin hit 599 here's what that actually means [R]
About this article
DPO (Rafailov et al., NeurIPS 2023) is supposed to be the clean alternative to PPO. No reward model in the training loop, no value function, no rollout collection. Just a binary cross-entropy loss over preference pairs. And the math is elegant the partition function Z(x) cancels out when you substitute the log-ratio reparameterisation into the Bradley-Terry model. I implemented it from scratch as part of a multi-stage RLHF project (same model, same tokenizer, same evaluation suite as my PPO a...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket