[2604.01499] Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training

[2604.01499] Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2604.01499: Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training

Computer Science > Machine Learning arXiv:2604.01499 (cs) [Submitted on 2 Apr 2026] Title:Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training Authors:William Hoy, Binxu Wang, Xu Pan View a PDF of the paper titled Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training, by William Hoy and 2 other authors View PDF HTML (experimental) Abstract:Evolution Strategies (ES) have emerged as a scalable gradient-free alternative to reinforcement learning based LLM fine-tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space. We compare ES and Group Relative Policy Optimization (GRPO) across four tasks in both single-task and sequential continual-learning settings. ES matches or exceeds GRPO in single-task accuracy and remains competitive sequentially when its iteration budget is controlled. Despite this similarity in task performance, the two methods produce markedly different model updates: ES makes much larger changes and induces broader off-task KL drift, whereas GRPO makes smaller, more localized updates. Strikingly, the ES and GRPO solutions are linearly connected with no loss barrier, even though their update directions are nearly orthogonal. We develop an analytical theory of ES that explains all these phenomena within a unified framework, showing how ES can accumulate large off-task movement on weakly informative directions while still making enough...

Originally published on April 03, 2026. Curated by AI News.

Related Articles

I used Jeff Bezos' Day 1 rule with ChatGPT to beat procrastination
Llms

I used Jeff Bezos' Day 1 rule with ChatGPT to beat procrastination

I used Jeff Bezos’ Day 1 rule with ChatGPT to stop procrastinating. These simple prompts helped me start faster, overthink less and get m...

AI Tools & Products · 9 min ·
Llms

ChatGPT and Claude? The Real-World AI Buzz Is Elsewhere

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. ...

AI Tools & Products · 1 min ·
Anthropic investigates unauthorized access to restricted Claude Mythos AI model
Llms

Anthropic investigates unauthorized access to restricted Claude Mythos AI model

Anthropic investigates unauthorized access to restricted Claude Mythos AI model - SiliconANGLE

AI Tools & Products · 5 min ·
Llms

Arc Sentry outperformed LLM Guard 92% vs 70% detection on a head to head benchmark. Here is how it works.

I built Arc Sentry, a pre-generation prompt injection detector for open-weight LLMs. Instead of scanning text for patterns after the fact...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime