[2503.18825] EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Summary
The paper presents evaluation methods for assessing the economic decision-making capabilities of LLMs, focusing on benchmarks and litmus tests derived from key economic problems.
Why It Matters
As LLMs are increasingly integrated into economic decision-making, understanding their capabilities and tendencies is crucial. This research provides foundational benchmarks and tests to evaluate LLM performance, offering insights into their reliability and decision-making processes.
Key Takeaways
- Development of benchmarks for evaluating LLMs in economic contexts.
- Introduction of litmus tests to quantify LLM decision-making behavior.
- Insights into LLM capabilities and tendencies over time.
- Validation of the litmus test framework for consistency and robustness.
- Foundation for future research on LLM integration in economic decision-making.
Computer Science > Artificial Intelligence arXiv:2503.18825 (cs) [Submitted on 24 Mar 2025 (v1), last revised 18 Feb 2026 (this version, v4)] Title:EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents Authors:Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski View a PDF of the paper titled EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents, by Sara Fish and 4 other authors View PDF HTML (experimental) Abstract:We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs. First, we develop benchmarks derived from key problems in economics -- procurement, scheduling, and pricing -- that test an LLM's ability to learn from the environment in context. Second, we develop the framework of litmus tests, evaluations that quantify an LLM's choice behavior on a stylized decision-making task with multiple conflicting objectives. Each litmus test outputs a litmus score, which quantifies an LLM's tradeoff response, a reliability score, which measures the coherence of an LLM's choice behavior, and a competency score, which measures an LLM's capability at the same task when the conflicting objectives are replaced by a single, well-specified objective. Evaluating a broad array of frontier LLMs, we (1) investigate changes in LLM capabilities and tendencies over time, (2) derive economically meaningful insights from the LLMs' choice behavior and chain-of-thou...