[2602.19006] Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks
Summary
This article evaluates 15 large language models on quantum mechanics problem-solving across diverse tasks, revealing performance stratification and the effects of tool augmentation.
Why It Matters
The study provides a systematic benchmark for assessing language models' capabilities in quantum mechanics, which is crucial for advancing AI applications in scientific domains. Understanding model performance can guide future research and development in AI and quantum physics.
Key Takeaways
- Flagship models achieve an average accuracy of 81%, outperforming mid-tier and fast models.
- Derivations are the easiest tasks for models, while numerical computations are the most challenging.
- Tool augmentation shows variable effects, with some tasks benefiting significantly while others do not.
- Reproducibility analysis indicates that flagship models demonstrate exceptional stability.
- The research provides a publicly available benchmark for future evaluations in quantum mechanics.
Computer Science > Artificial Intelligence arXiv:2602.19006 (cs) [Submitted on 9 Nov 2025] Title:Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks Authors:S. K. Rithvik View a PDF of the paper titled Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks, by S. K. Rithvik View PDF HTML (experimental) Abstract:We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81\% average accuracy, outperforming mid-tier (77\%) and fast models (67\%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92\% average, 100\% for flagship models), while numerical computation remains most challenging (42\%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity ranging from +29pp gains to -16pp degradation. Reproducibility analysis across three runs quantifies 6.3pp average variance, with flagship models demonstrating exceptional stability...