[2509.18776] AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

[2509.18776] AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

arXiv - Machine Learning 4 min read Article

Summary

The paper introduces AECBench, a benchmark for evaluating large language models (LLMs) in the Architecture, Engineering, and Construction (AEC) field, highlighting their strengths and limitations across cognitive tasks.

Why It Matters

As LLMs are increasingly integrated into the AEC sector, understanding their reliability and performance is crucial for ensuring safety and efficiency in engineering practices. AECBench provides a structured evaluation framework that can guide future developments in this area.

Key Takeaways

  • AECBench establishes a five-level cognitive evaluation framework for LLMs.
  • The benchmark includes 23 tasks derived from real AEC practices.
  • Performance declines were noted in complex reasoning and document generation tasks.
  • A dataset of 4,800 questions was created to assess LLM capabilities.
  • The study lays groundwork for future research on LLM integration in safety-critical fields.

Computer Science > Computation and Language arXiv:2509.18776 (cs) [Submitted on 23 Sep 2025 (v1), last revised 14 Feb 2026 (this version, v3)] Title:AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field Authors:Chen Liang, Zhaoqi Huang, Haofen Wang, Fu Chai, Chunying Yu, Huanhuan Wei, Zhengjie Liu, Yanpeng Li, Hongjun Wang, Ruifeng Luo, Xianzhong Zhao View a PDF of the paper titled AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field, by Chen Liang and 10 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark features a five-level, cognition-oriented evaluation framework (i.e., Knowledge Memorization, Understanding, Reasoning, Calculation, and Application). Based on the framework, 23 representative evaluation tasks were defined. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specializ...

Related Articles

Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime