[2602.17632] SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

[2602.17632] SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

arXiv - Machine Learning 3 min read Article

Summary

The paper presents SMAC, a novel offline reinforcement learning method that enhances the transition from offline to online learning without performance drops, addressing a critical challenge in actor-critic algorithms.

Why It Matters

This research is significant as it tackles a common issue in reinforcement learning where performance deteriorates during the transition from offline training to online execution. By providing a solution that minimizes regret and ensures smoother transitions, it can enhance the applicability of reinforcement learning in real-world scenarios.

Key Takeaways

  • SMAC regularizes the Q-function to maintain performance during online fine-tuning.
  • The method demonstrates a significant reduction in regret (34-58%) compared to existing baselines.
  • SMAC successfully connects offline maxima to better online maxima, facilitating smoother transitions.
  • It achieves consistent performance across multiple D4RL tasks.
  • The findings could lead to more robust applications of reinforcement learning in practical environments.

Computer Science > Machine Learning arXiv:2602.17632 (cs) [Submitted on 19 Feb 2026] Title:SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer Authors:Nathan S. de Lara, Florian Shkurti View a PDF of the paper titled SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer, by Nathan S. de Lara and Florian Shkurti View PDF HTML (experimental) Abstract:Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6...

Related Articles

[2604.01676] GPA: Learning GUI Process Automation from Demonstrations
Llms

[2604.01676] GPA: Learning GUI Process Automation from Demonstrations

Abstract page for arXiv paper 2604.01676: GPA: Learning GUI Process Automation from Demonstrations

arXiv - AI · 3 min ·
[2604.01413] Adaptive Stopping for Multi-Turn LLM Reasoning
Llms

[2604.01413] Adaptive Stopping for Multi-Turn LLM Reasoning

Abstract page for arXiv paper 2604.01413: Adaptive Stopping for Multi-Turn LLM Reasoning

arXiv - AI · 4 min ·
[2603.13842] Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving
Machine Learning

[2603.13842] Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

Abstract page for arXiv paper 2603.13842: Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement L...

arXiv - AI · 4 min ·
[2603.12510] Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies
Machine Learning

[2603.12510] Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

Abstract page for arXiv paper 2603.12510: Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Ro...

arXiv - AI · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime