[2602.12963] Information-theoretic analysis of world models in optimal reward maximizers
Summary
This paper presents an information-theoretic analysis of world models in optimal reward maximizers, quantifying the information conveyed by optimal policies in controlled Markov processes.
Why It Matters
Understanding the relationship between optimal policies and the information they provide about the environment is crucial for advancing AI systems. This research offers a foundational insight into how internal representations of the world can enhance decision-making in AI, which is vital for developing more effective and efficient AI agents.
Key Takeaways
- Optimal policies in controlled Markov processes convey exactly n log m bits of information about the environment.
- The findings establish a lower bound on the implicit world model necessary for achieving optimality in various reward structures.
- The research applies to a broad range of objectives, including finite-horizon and infinite-horizon scenarios.
Computer Science > Artificial Intelligence arXiv:2602.12963 (cs) [Submitted on 13 Feb 2026] Title:Information-theoretic analysis of world models in optimal reward maximizers Authors:Alfred Harwood, Jose Faustino, Alex Altair View a PDF of the paper titled Information-theoretic analysis of world models in optimal reward maximizers, by Alfred Harwood and 2 other authors View PDF HTML (experimental) Abstract:An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with $n$ states and $m$ actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly $n \log m$ bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is $n \log m$ bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model'' necessary for optimality. Comments: Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.12963 [cs.AI] (or arXiv:2602.12963v1 [cs.AI] for th...