[2511.22581] High entropy leads to symmetry equivariant policies in Dec-POMDPs

[2511.22581] High entropy leads to symmetry equivariant policies in Dec-POMDPs

arXiv - Machine Learning 4 min read Article

Summary

This paper explores how high entropy regularization in Dec-POMDPs leads to symmetry equivariant policies, ensuring convergence to a consistent joint policy across varying initializations.

Why It Matters

The findings highlight the importance of entropy in policy training for Dec-POMDPs, suggesting that higher entropy coefficients can enhance policy compatibility and performance. This is particularly relevant for multi-agent systems, where consistent policy behavior is crucial for effective collaboration.

Key Takeaways

  • High entropy regularization ensures convergence to a consistent joint policy in Dec-POMDPs.
  • Policies trained with different seeds can achieve compatibility in performance metrics.
  • Empirical results indicate that higher entropy coefficients should be considered during hyperparameter tuning.

Computer Science > Machine Learning arXiv:2511.22581 (cs) [Submitted on 27 Nov 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:High entropy leads to symmetry equivariant policies in Dec-POMDPs Authors:Johannes Forkel, Constantin Ruhdorfer, Andreas Bulling, Jakob Foerster View a PDF of the paper titled High entropy leads to symmetry equivariant policies in Dec-POMDPs, by Johannes Forkel and 3 other authors View PDF HTML (experimental) Abstract:We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that policy gradient ascent with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different random seeds will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive empirical evaluation of independent PPO in the Hanabi, Overcooked, and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the drop in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi we achieve a new SOTA in inter-seed cross-play this way. Despite clear limitations of this recipe, which we point out, both our theoretical and empirical results indicate that ...

Related Articles

Ai Startups

This AI startup envisions 100 Million New People Making Videogames

submitted by /u/sharkymcstevenson2 [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Llms

A robot car with a Claude AI brain started a YouTube vlog about its own existence

Not a demo reel. Not a tutorial. A robot narrating its own experience — debugging, falling off shelves, questioning its identity. First-p...

Reddit - Artificial Intelligence · 1 min ·
Anthropic ramps up its political activities with a new PAC | TechCrunch
Ai Startups

Anthropic ramps up its political activities with a new PAC | TechCrunch

With the midterms right around the corner, the new group is positioned to back candidates who support the AI company's policy agenda.

TechCrunch - AI · 3 min ·
Anthropic buys biotech startup Coefficient Bio in $400M deal: Reports | TechCrunch
Ai Startups

Anthropic buys biotech startup Coefficient Bio in $400M deal: Reports | TechCrunch

Anthropic has purchased the stealth biotech AI startup Coefficient Bio in a $400 million stock deal, according to The Information and Eri...

TechCrunch - AI · 3 min ·
More in Ai Startups: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime