[2602.19160] Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Summary
This paper evaluates the reasoning capabilities of Large Language Models (LLMs) through General Game Playing tasks, revealing performance trends and common reasoning errors.
Why It Matters
Understanding the reasoning capabilities of LLMs is crucial for improving their application in complex decision-making environments. This research provides insights into their strengths and weaknesses, which can inform future model development and deployment in AI applications.
Key Takeaways
- LLMs demonstrate strong performance in structured reasoning tasks.
- Performance declines with increased complexity and evaluation horizon.
- Common reasoning errors include hallucinated rules and redundant facts.
- The study highlights the importance of linguistic semantics in game definitions.
- Insights can guide improvements in LLM training and application.
Computer Science > Artificial Intelligence arXiv:2602.19160 (cs) [Submitted on 22 Feb 2026] Title:Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing Authors:Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk View a PDF of the paper titled Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing, by Maciej \'Swiechowski and 2 other authors View PDF HTML (experimental) Abstract:This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game instances. Beyond reporting instance-level performance, we characterize games based on 40 structural features and analyze correlations between these features and LLM performance. Furthermore, we investigate the effects of various game obfuscations to assess the role of linguistic semantics in game definitions and the impact of potential prior exposure of LLMs to specific games during training. The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation obse...