[2602.17632] SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer
Summary
The paper presents SMAC, a novel offline reinforcement learning method that enhances the transition from offline to online learning without performance drops, addressing a critical challenge in actor-critic algorithms.
Why It Matters
This research is significant as it tackles a common issue in reinforcement learning where performance deteriorates during the transition from offline training to online execution. By providing a solution that minimizes regret and ensures smoother transitions, it can enhance the applicability of reinforcement learning in real-world scenarios.
Key Takeaways
- SMAC regularizes the Q-function to maintain performance during online fine-tuning.
- The method demonstrates a significant reduction in regret (34-58%) compared to existing baselines.
- SMAC successfully connects offline maxima to better online maxima, facilitating smoother transitions.
- It achieves consistent performance across multiple D4RL tasks.
- The findings could lead to more robust applications of reinforcement learning in practical environments.
Computer Science > Machine Learning arXiv:2602.17632 (cs) [Submitted on 19 Feb 2026] Title:SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer Authors:Nathan S. de Lara, Florian Shkurti View a PDF of the paper titled SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer, by Nathan S. de Lara and Florian Shkurti View PDF HTML (experimental) Abstract:Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6...