[2604.06159] Target Policy Optimization
About this article
Abstract page for arXiv paper 2604.06159: Target Policy Optimization
Computer Science > Machine Learning arXiv:2604.06159 (cs) [Submitted on 7 Apr 2026] Title:Target Policy Optimization Authors:Jean Kaddour View a PDF of the paper titled Target Policy Optimization, by Jean Kaddour View PDF HTML (experimental) Abstract:In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution $q_i \propto p_i^{\,\mathrm{old}} \exp(u_i)$ and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is $p^\theta - q$, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.06159 [cs.LG] (or arXiv:2604.06159v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.06159 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jean Kaddour [view email...