[2603.23232] GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL
About this article
Abstract page for arXiv paper 2603.23232: GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL
Computer Science > Machine Learning arXiv:2603.23232 (cs) [Submitted on 24 Mar 2026] Title:GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL Authors:Haoyu Wang, Jingcheng Wang, Shunyu Wu, Xinwei Xiao View a PDF of the paper titled GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL, by Haoyu Wang and Jingcheng Wang and Shunyu Wu and Xinwei Xiao View PDF HTML (experimental) Abstract:Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield "in-between" actions that are weakly supported by data, making decisions brittle even with a strong critic. We introduce GEM (Guided Expectation-Maximization), an analytical framework that makes action selection both multimodal and explicitly controllable. GEM trains a Gaussian Mixture Model (GMM) actor via critic-guided, advantage-weighted EM-style updates that preserve distinct components while shifting probability mass toward high-value regions, and learns a tractable GMM behavior model to quantify support. During inference, GEM performs candidate-based selection: it generates a parallel candidate set and reranks actions using a conservative ensemble lower-confidence bound together wi...