[2510.07172] NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
Summary
The article introduces NewtonBench, a new benchmark for evaluating large language models (LLMs) in scientific law discovery, addressing key methodological challenges in existing benchmarks.
Why It Matters
As AI continues to advance, the ability to discover scientific laws through LLMs is crucial for the future of AI-driven research. NewtonBench aims to provide a more accurate and scalable framework for assessing LLM capabilities, potentially guiding the development of more effective AI agents in scientific exploration.
Key Takeaways
- NewtonBench offers 324 tasks across 12 physics domains for LLM evaluation.
- The benchmark addresses the trade-off between scientific relevance, scalability, and memorization resistance.
- It emphasizes interactive model discovery over static function fitting.
- Findings reveal that LLMs struggle with complex systems and observational noise.
- Tool assistance can paradoxically hinder LLM performance by shifting focus from exploration to exploitation.
Computer Science > Artificial Intelligence arXiv:2510.07172 (cs) [Submitted on 8 Oct 2025 (v1), last revised 24 Feb 2026 (this version, v3)] Title:NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents Authors:Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See View a PDF of the paper titled NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents, by Tianshi Zheng and 12 other authors View PDF HTML (experimental) Abstract:Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, an...