[2504.12764] GraphOmni: A Comprehensive and Extensible Benchmark Framework for Large Language Models on Graph-theoretic Tasks
Summary
GraphOmni introduces a benchmark framework for evaluating large language models on graph-theoretic tasks, highlighting performance variability and the need for tailored approaches.
Why It Matters
This research is significant as it addresses the limitations of existing benchmarks for LLMs in graph reasoning, providing a comprehensive framework that enhances understanding and guides future research. The findings emphasize the importance of model evaluation across diverse factors, which can lead to improved performance in real-world applications.
Key Takeaways
- GraphOmni offers a comprehensive evaluation framework for LLMs on graph tasks.
- Performance varies significantly based on serialization and prompting strategies.
- State-of-the-art models show room for improvement in graph reasoning.
- The framework encourages tailored approaches for open-source and closed-source models.
- A reinforcement learning-inspired method is proposed for optimal factor selection.
Computer Science > Machine Learning arXiv:2504.12764 (cs) [Submitted on 17 Apr 2025 (v1), last revised 22 Feb 2026 (this version, v4)] Title:GraphOmni: A Comprehensive and Extensible Benchmark Framework for Large Language Models on Graph-theoretic Tasks Authors:Hao Xu, Xiangru Jian, Xinjian Zhao, Wei Pang, Chao Zhang, Suyuchen Wang, Qixin Zhang, Zhengyuan Dong, Joao Monteiro, Bang Liu, Qiuzhuang Sun, Tianshu Yu View a PDF of the paper titled GraphOmni: A Comprehensive and Extensible Benchmark Framework for Large Language Models on Graph-theoretic Tasks, by Hao Xu and 11 other authors View PDF Abstract:This paper introduces GraphOmni, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language. GraphOmni encompasses diverse graph types, serialization formats, and prompting schemes, significantly exceeding prior efforts in both scope and depth. Through extensive systematic evaluation, we identify critical interactions among these dimensions, demonstrating their substantial impact on model performance. Our experiments reveal that state-of-the-art models like Claude-3.5 and o4-mini consistently outperform other models, yet even these leading models exhibit substantial room for improvement. Performance variability is evident depending on the specific combinations of factors we considered, underscoring the necessity of comprehensive evaluations across these interconnected dimensions. Additionally, we ...