Ai Startups Machine Learning Ai Agents

[2511.22581] High entropy leads to symmetry equivariant policies in Dec-POMDPs

arXiv - Machine Learning February 19, 2026 4 min read Article

Summary

This paper explores how high entropy regularization in Dec-POMDPs leads to symmetry equivariant policies, ensuring convergence to a consistent joint policy across varying initializations.

Why It Matters

The findings highlight the importance of entropy in policy training for Dec-POMDPs, suggesting that higher entropy coefficients can enhance policy compatibility and performance. This is particularly relevant for multi-agent systems, where consistent policy behavior is crucial for effective collaboration.

Key Takeaways

High entropy regularization ensures convergence to a consistent joint policy in Dec-POMDPs.
Policies trained with different seeds can achieve compatibility in performance metrics.
Empirical results indicate that higher entropy coefficients should be considered during hyperparameter tuning.

Computer Science > Machine Learning arXiv:2511.22581 (cs) [Submitted on 27 Nov 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:High entropy leads to symmetry equivariant policies in Dec-POMDPs Authors:Johannes Forkel, Constantin Ruhdorfer, Andreas Bulling, Jakob Foerster View a PDF of the paper titled High entropy leads to symmetry equivariant policies in Dec-POMDPs, by Johannes Forkel and 3 other authors View PDF HTML (experimental) Abstract:We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that policy gradient ascent with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different random seeds will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive empirical evaluation of independent PPO in the Hanabi, Overcooked, and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the drop in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi we achieve a new SOTA in inter-seed cross-play this way. Despite clear limitations of this recipe, which we point out, both our theoretical and empirical results indicate that ...

Read Original Article

[2511.22581] High entropy leads to symmetry equivariant policies in Dec-POMDPs

Summary

Why It Matters

Key Takeaways

Related Articles

This AI startup envisions 100 Million New People Making Videogames

A robot car with a Claude AI brain started a YouTube vlog about its own existence

Anthropic ramps up its political activities with a new PAC | TechCrunch

Anthropic buys biotech startup Coefficient Bio in $400M deal: Reports | TechCrunch

No comments

Stay updated with AI News