[2410.05669] ACPBench: Reasoning about Action, Change, and Planning

[2410.05669] ACPBench: Reasoning about Action, Change, and Planning

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2410.05669: ACPBench: Reasoning about Action, Change, and Planning

Computer Science > Artificial Intelligence arXiv:2410.05669 (cs) [Submitted on 8 Oct 2024 (v1), last revised 27 Feb 2026 (this version, v3)] Title:ACPBench: Reasoning about Action, Change, and Planning Authors:Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi View a PDF of the paper titled ACPBench: Reasoning about Action, Change, and Planning, by Harsha Kokel and 3 other authors View PDF HTML (experimental) Abstract:There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no not...

Originally published on March 03, 2026. Curated by AI News.

Related Articles

Llms

8 free AI courses from Anthropic’s Claude platform with certificates

AI News - General ·
Llms

Claude developer hosts Christian leaders for AI summit

AI Tools & Products ·
CoreWeave stock pops 11% on deal to power Anthropic's Claude
Llms

CoreWeave stock pops 11% on deal to power Anthropic's Claude

AI Tools & Products · 3 min ·
Llms

I Trained for the Paris Marathon Using ChatGPT

AI Tools & Products · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime