[2602.17831] The Token Games: Evaluating Language Model Reasoning with Puzzle Duels
Summary
The Token Games introduces a novel evaluation framework for language models, using puzzle duels to assess reasoning capabilities without human-generated questions.
Why It Matters
As language models become more advanced, traditional evaluation methods may fail to accurately measure their reasoning abilities. The Token Games offers an innovative approach that allows models to create and solve their own puzzles, potentially leading to more reliable assessments and insights into model capabilities.
Key Takeaways
- The Token Games framework allows language models to challenge each other with self-created puzzles.
- Elo ratings are used to compare model performance based on puzzle-solving capabilities.
- The framework highlights the difficulty models face in creating quality puzzles, which is not captured by existing benchmarks.
- This approach opens new paradigms for evaluating reasoning and creativity in AI models.
- The results align closely with existing benchmarks, demonstrating the framework's validity.
Computer Science > Artificial Intelligence arXiv:2602.17831 (cs) [Submitted on 19 Feb 2026] Title:The Token Games: Evaluating Language Model Reasoning with Puzzle Duels Authors:Simon Henniger, Gabriel Poesia View a PDF of the paper titled The Token Games: Evaluating Language Model Reasoning with Puzzle Duels, by Simon Henniger and 1 other authors View PDF HTML (experimental) Abstract:Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, without involving any human effort in creating puzzles. We also find that...