[2506.14518] Two-Player Zero-Sum Games with Bandit Feedback
Summary
This paper explores two-player zero-sum games using bandit feedback, proposing algorithms that adapt existing frameworks to optimize action selection and minimize regret.
Why It Matters
Understanding zero-sum games is crucial in fields like economics and AI, where strategic decision-making is essential. The proposed algorithms enhance learning efficiency in competitive scenarios, contributing to advancements in machine learning and game theory.
Key Takeaways
- The study introduces three algorithms for optimizing action selection in zero-sum games.
- Instance-dependent regret bounds are derived, offering insights into algorithm performance.
- The algorithms adapt existing frameworks to improve efficiency in learning Nash Equilibria.
Computer Science > Machine Learning arXiv:2506.14518 (cs) [Submitted on 17 Jun 2025 (v1), last revised 19 Feb 2026 (this version, v4)] Title:Two-Player Zero-Sum Games with Bandit Feedback Authors:Elif Yılmaz, Christos Dimitrakakis View a PDF of the paper titled Two-Player Zero-Sum Games with Bandit Feedback, by Elif Y{\i}lmaz and Christos Dimitrakakis View PDF HTML (experimental) Abstract:We study a two-player zero-sum game in which the row player aims to maximize their payoff against a competing column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit (ETC) and action pair elimination frameworks. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the $\varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC and action pair elimination algorithms in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(\Delta + \sqrt{T})$ for ETC in zero-sum game setting an...