[2511.20836] Structured Prompts Improve Evaluation of Language Models
About this article
Abstract page for arXiv paper 2511.20836: Structured Prompts Improve Evaluation of Language Models
Computer Science > Computation and Language arXiv:2511.20836 (cs) [Submitted on 25 Nov 2025 (v1), last revised 1 Apr 2026 (this version, v3)] Title:Structured Prompts Improve Evaluation of Language Models Authors:Asad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia, Arnav Singhvi, Richard Gaus, Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Yifan Mai, Jordan Cahoon, Michael Pfeffer, Roxana Daneshjou, Sanmi Koyejo, Emily Alsentzer, Christopher Potts, Nigam H. Shah, Akshay S. Chaudhari View a PDF of the paper titled Structured Prompts Improve Evaluation of Language Models, by Asad Aali and 17 other authors View PDF HTML (experimental) Abstract:As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks are essential for guiding deployment decisions. In practice, however, frameworks such as Holistic Evaluation of Language Models (HELM) typically evaluate models under a single static prompt configuration, even though model behavior depends strongly on prompt choice. As a result, reported scores can reflect prompt choice as much as model capability. Declarative prompting frameworks such as DSPy offer a scalable way to evaluate models under a set of structured prompting strategies rather than a static prompt configuration. We present a reproducible DSPy+HELM framework for studying how prompt choice impacts reported benchmark outcomes. Using five prompting methods, we evaluate four frontier and two open-source LMs across seven benchma...