[2602.21424] On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

[2602.21424] On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

arXiv - AI 3 min read Article

Summary

The paper explores how reinforcement learning agents' actions depend on internal information, revealing structural conditions affecting behavioral equivalence under policy transformations.

Why It Matters

Understanding the structural non-preservation of epistemic behavior is crucial for developing robust reinforcement learning systems. This research highlights the limitations of current policies in adapting to internal information changes, which can impact AI decision-making processes.

Key Takeaways

  • Behavioral dependency in RL agents varies with internal information.
  • Policies with non-trivial behavioral dependency are not closed under convex aggregation.
  • Behavioral distance contracts under convex combinations, affecting optimization outcomes.

Computer Science > Machine Learning arXiv:2602.21424 (cs) [Submitted on 24 Feb 2026] Title:On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation Authors:Alexander Galozy View a PDF of the paper titled On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation, by Alexander Galozy View PDF HTML (experimental) Abstract:Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context. We formalise such information-conditioned interaction patterns as behavioural dependency: variation in action selection with respect to internal information under fixed observations. This induces a probe-relative notion of $\epsilon$-behavioural equivalence and a within-policy behavioural distance that quantifies probe sensitivity. We establish three structural results. First, the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation. Second, behavioural distance contracts under convex combination. Third, we prove a sufficient local condition under which gradient ascent on a skewed mixture objective decreases behavioural distance when a dominant-mode gradient aligns with the direction of steepest contraction. Minimal bandit and partially observable gridworld experiments provide controlled witnesses of these mechanisms. In the examined settings, behavioural distance decreases under convex agg...

Related Articles

Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
Machine Learning

[D] TMLR reviews seem more reliable than ICML/NeurIPS/ICLR

This year I submitted a paper to ICML for the first time. I have also experienced the review process at TMLR and ICLR. From my observatio...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] icml, no rebuttal ack so far..

Almost all the papers I reviewed have received at least one ack, but I haven’t gotten a single rebuttal acknowledgment yet. Is there anyo...

Reddit - Machine Learning · 1 min ·
UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime