[2602.21424] On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation
Summary
The paper explores how reinforcement learning agents' actions depend on internal information, revealing structural conditions affecting behavioral equivalence under policy transformations.
Why It Matters
Understanding the structural non-preservation of epistemic behavior is crucial for developing robust reinforcement learning systems. This research highlights the limitations of current policies in adapting to internal information changes, which can impact AI decision-making processes.
Key Takeaways
- Behavioral dependency in RL agents varies with internal information.
- Policies with non-trivial behavioral dependency are not closed under convex aggregation.
- Behavioral distance contracts under convex combinations, affecting optimization outcomes.
Computer Science > Machine Learning arXiv:2602.21424 (cs) [Submitted on 24 Feb 2026] Title:On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation Authors:Alexander Galozy View a PDF of the paper titled On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation, by Alexander Galozy View PDF HTML (experimental) Abstract:Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context. We formalise such information-conditioned interaction patterns as behavioural dependency: variation in action selection with respect to internal information under fixed observations. This induces a probe-relative notion of $\epsilon$-behavioural equivalence and a within-policy behavioural distance that quantifies probe sensitivity. We establish three structural results. First, the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation. Second, behavioural distance contracts under convex combination. Third, we prove a sufficient local condition under which gradient ascent on a skewed mixture objective decreases behavioural distance when a dominant-mode gradient aligns with the direction of steepest contraction. Minimal bandit and partially observable gridworld experiments provide controlled witnesses of these mechanisms. In the examined settings, behavioural distance decreases under convex agg...