[2510.24983] LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies
Summary
LRT-Diffusion introduces a risk-aware sampling method for diffusion policies in offline reinforcement learning, enhancing decision-making through calibrated risk control.
Why It Matters
This research addresses the limitations of existing diffusion policies by incorporating a statistical approach to risk management, potentially improving performance in offline reinforcement learning tasks. It offers a novel framework that balances exploration and exploitation, which is crucial for developing safer and more effective AI systems.
Key Takeaways
- LRT-Diffusion provides a risk-aware sampling rule for diffusion policies.
- The method allows for evidence-driven adjustments based on user-defined risk budgets.
- It improves the return-OOD trade-off in offline reinforcement learning tasks.
- The framework integrates seamlessly with existing Q-guided baselines.
- Theoretical foundations establish stability bounds and performance comparisons.
Computer Science > Machine Learning arXiv:2510.24983 (cs) [Submitted on 28 Oct 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies Authors:Ximan Sun, Xiang Cheng View a PDF of the paper titled LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies, by Ximan Sun and 1 other authors View PDF HTML (experimental) Abstract:Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report ...