[2511.12779] Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation
Summary
This paper presents a novel approach to multi-objective reinforcement learning by introducing a two-stage procedure that efficiently estimates policies for multiple objectives, demonstrating significant performance improvements in robotic control tasks.
Why It Matters
As reinforcement learning applications expand, particularly in robotics and AI, the ability to optimize multiple objectives simultaneously becomes crucial. This research addresses the inefficiencies of traditional methods, offering a scalable solution that enhances performance and reduces training time, which is vital for real-world applications.
Key Takeaways
- Introduces a two-stage procedure for multi-objective reinforcement learning.
- Demonstrates an average performance improvement of 16% over state-of-the-art methods.
- Achieves up to 26 times faster training speed compared to full training methods.
- Validates the effectiveness of loss-based clustering for objective grouping.
- Analyzes generalization error through Hessian trace measurement.
Computer Science > Machine Learning arXiv:2511.12779 (cs) [Submitted on 16 Nov 2025 (v1), last revised 23 Feb 2026 (this version, v3)] Title:Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation Authors:Zhenshuo Zhang, Minxuan Duan, Youran Ye, Hongyang R. Zhang View a PDF of the paper titled Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation, by Zhenshuo Zhang and 3 other authors View PDF HTML (experimental) Abstract:We study the problem of efficiently estimating policies that simultaneously optimize multiple objectives in reinforcement learning (RL). Given $n$ objectives (or tasks), we seek the optimal partition of these objectives into $k \ll n$ groups, where each group comprises related objectives that can be trained together. This problem arises in applications such as robotics, control, and preference optimization in language models, where learning a single policy for all $n$ objectives is suboptimal as $n$ grows. We introduce a two-stage procedure -- meta-training followed by fine-tuning -- to address this problem. We first learn a meta-policy for all objectives using multitask learning. Then, we adapt the meta-policy to multiple randomly sampled subsets of objectives. The adaptation step leverages a first-order approximation property of well-trained policy networks, which is empirically verified to be accurate within a 2% error margin across various RL environments. The resulting algorithm, PolicyGradEx, effic...