[2603.02229] Safety Training Persists Through Helpfulness Optimization

[2603.02229] Safety Training Persists Through Helpfulness Optimization in LLM Agents

arXiv - Machine Learning March 04, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.02229: Safety Training Persists Through Helpfulness Optimization in LLM Agents

Computer Science > Machine Learning arXiv:2603.02229 (cs) [Submitted on 13 Feb 2026] Title:Safety Training Persists Through Helpfulness Optimization in LLM Agents Authors:Benjamin Plaut View a PDF of the paper titled Safety Training Persists Through Helpfulness Optimization in LLM Agents, by Benjamin Plaut View PDF HTML (experimental) Abstract:Safety post-training has been studied extensively in single-step "chat" settings where safety typically refers to refusing harmful requests. We study an "agentic" (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with $R^2 = 0.77$. Even post-training on both metrics simultaneously simply results in another point on the frontier rather than finding a "best of both worlds" strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the need for better understanding of post-training dynamics. Comments: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2603.02229 [cs.LG] (or arXiv:2603.02229v1 [cs.LG] f...

Originally published on March 04, 2026. Curated by AI News.

Llms

Why are we blindly trusting AI companies with our data?

Lately I’ve been seeing a story floating around that really made me pause. Apparently, there were claims that the US government asked Ant...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min · about 3 hours ago

Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min · about 4 hours ago

Llms

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Abstract page for arXiv paper 2603.15159: To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

arXiv - AI · 4 min · about 4 hours ago

[2603.02229] Safety Training Persists Through Helpfulness Optimization in LLM Agents

About this article

Related Articles

Why are we blindly trusting AI companies with our data?

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

[2603.16629] MLLM-based Textual Explanations for Face Comparison

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

No comments

Stay updated with AI News