[2605.00300] Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
About this article
Abstract page for arXiv paper 2605.00300: Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
Computer Science > Artificial Intelligence arXiv:2605.00300 (cs) [Submitted on 1 May 2026] Title:Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference Authors:Yuxuan Gao, Megan Wang, Yi Ling Yu View a PDF of the paper titled Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference, by Yuxuan Gao and 2 other authors View PDF HTML (experimental) Abstract:Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving stack is exposed. We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). The framework's novelty is empirical and methodological. Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity to first party by up to 12 points, in tail latency by an order of magnitude, and in modeled joule...