[2511.05722] OckBench: Measuring the Efficiency of LLM Reasoning
Summary
The paper introduces OckBench, a benchmark designed to measure the efficiency of token usage in large language models (LLMs), highlighting the need for improved token efficiency alongside accuracy.
Why It Matters
As LLMs become increasingly prevalent in various applications, understanding their efficiency in token usage is crucial for optimizing performance and reducing costs. OckBench addresses a gap in current evaluation methods, promoting a shift towards more efficient model design and deployment.
Key Takeaways
- OckBench is the first benchmark to assess both accuracy and token efficiency in LLMs.
- Current models exhibit significant variability in token usage, impacting operational costs.
- Optimizing token efficiency can lead to reduced latency and improved reasoning capabilities.
- The findings advocate for a paradigm shift in evaluating LLMs, emphasizing the importance of minimizing unnecessary token usage.
- Benchmarks are publicly available to encourage community engagement and improvement.
Computer Science > Computation and Language arXiv:2511.05722 (cs) [Submitted on 7 Nov 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:OckBench: Measuring the Efficiency of LLM Reasoning Authors:Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu View a PDF of the paper titled OckBench: Measuring the Efficiency of LLM Reasoning, by Zheng Du and Hao Kang and Song Han and Tushar Krishna and Ligeng Zhu View PDF HTML (experimental) Abstract:Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with similar accuracy can exhibit up to a \textbf{5.0$\times$} difference in token length, leading to massive gap of model reasoning ability. Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency. Thus, we introduce OckBench, the first benchmark that jointly measures accuracy and token efficiency across reasoning and coding tasks. Our evaluation reveals that token efficiency remains largely unoptimized across current models, significantly inflating serving costs and latency. These findings provide a concrete roadmap for the community to optimize the latent reasoning ability, token efficiency. Ultimately, w...