[2502.01754] Evaluation of Large Language Models via Coupled Token Generation
About this article
Abstract page for arXiv paper 2502.01754: Evaluation of Large Language Models via Coupled Token Generation
Computer Science > Computation and Language arXiv:2502.01754 (cs) [Submitted on 3 Feb 2025 (v1), last revised 24 Mar 2026 (this version, v3)] Title:Evaluation of Large Language Models via Coupled Token Generation Authors:Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, Manuel Gomez-Rodriguez View a PDF of the paper titled Evaluation of Large Language Models via Coupled Token Generation, by Nina Corvelo Benz and 6 other authors View PDF Abstract:State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an i...