[2602.08655] From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism
Summary
This article presents a novel framework called Geometric Pessimism for Offline Reinforcement Learning (RL), enhancing performance in robotics and sepsis treatment by improving policy recovery from static datasets.
Why It Matters
The research addresses critical challenges in Offline RL, particularly the overestimation of out-of-distribution actions, which can lead to poor decision-making in real-world applications like healthcare. By providing a more efficient and effective method, this work has implications for improving safety and performance in automated systems.
Key Takeaways
- Geometric Pessimism enhances Offline RL by mitigating OOD action overestimation.
- The Geo-IQL method significantly outperforms standard IQL in unstable environments.
- The approach maintains safety constraints while improving decision-making in critical care settings.
Computer Science > Machine Learning arXiv:2602.08655 (cs) [Submitted on 9 Feb 2026 (v1), last revised 16 Feb 2026 (this version, v2)] Title:From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism Authors:Sarthak Wanjari View a PDF of the paper titled From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism, by Sarthak Wanjari View PDF HTML (experimental) Abstract:Offline Reinforcement Learning (RL) promises the recovery of optimal policies from static datasets, yet it remains susceptible to the overestimation of out-of-distribution (OOD) actions, particularly in fractured and sparse data manifolds. Current solutions necessitate a trade-off between computational efficiency and performance. Methods like CQL offer rigorous conservatism but require tremendous compute power while efficient expectile-based methods like IQL often fail to correct OOD errors on pathological datasets, collapsing to Behavioural Cloning. In this work, we propose Geometric Pessimism, a modular, compute-efficient framework that augments standard IQL with density-based penalty derived from k-nearest-neighbour distances in the state-action embedding space. By pre-computing the penalties applied to each state-action pair, our method injects OOD conservatism via reward shaping with a O(1) training overhead to the training loop. Evaluated on the D4RL MuJoCo benchmark, our method, Geo-IQL outperforms standard IQL on sensitive and unstable medium-replay tasks by over 18 points, while...