[2507.05306] Enjoying Non-linearity in Multinomial Logistic Bandits: A Minimax-Optimal Algorithm
Summary
This paper presents a minimax-optimal algorithm for the multinomial logistic bandit problem, enhancing existing regret guarantees by leveraging non-linearity in the logistic model.
Why It Matters
Understanding non-linearity in logistic models is crucial for improving decision-making in complex environments, such as reinforcement learning and recommendation systems. This research offers a significant advancement in minimizing regret, which is essential for developing effective algorithms in machine learning applications.
Key Takeaways
- The paper extends the analysis of non-linearity in logistic models to multinomial settings.
- Introduces a new algorithm that improves regret bounds significantly compared to existing methods.
- Establishes a matching lower bound, demonstrating the optimality of the proposed algorithm.
- Highlights the importance of the problem-dependent constant kappa_* in performance guarantees.
- Provides insights applicable to reinforcement learning and recommender systems.
Statistics > Machine Learning arXiv:2507.05306 (stat) [Submitted on 7 Jul 2025 (v1), last revised 24 Feb 2026 (this version, v3)] Title:Enjoying Non-linearity in Multinomial Logistic Bandits: A Minimax-Optimal Algorithm Authors:Pierre Boudart (SIERRA), Pierre Gaillard (Thoth), Alessandro Rudi (PSL, DI-ENS, Inria) View a PDF of the paper titled Enjoying Non-linearity in Multinomial Logistic Bandits: A Minimax-Optimal Algorithm, by Pierre Boudart (SIERRA) and 4 other authors View PDF Abstract:We consider the multinomial logistic bandit problem in which a learner interacts with an environment by selecting actions to maximize expected rewards based on probabilistic feedback from multiple possible outcomes. In the binary setting, recent work has focused on understanding the impact of the non-linearity of the logistic model (Faury et al., 2020; Abeille et al., 2021). They introduced a problem-dependent constant $\kappa_* \geq 1$ that may be exponentially large in some problem parameters and which is captured by the derivative of the sigmoid function. It encapsulates the non-linearity and improves existing regret guarantees over $T$ rounds from $\smash{O(d\sqrt{T})}$ to $\smash{O(d\sqrt{T/\kappa_*})}$, where $d$ is the dimension of the parameter space. We extend their analysis to the multinomial logistic bandit framework with a finite action space, making it suitable for complex applications with more than two choices, such as reinforcement learning or recommender systems. To ach...