[2602.18481] AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models
Summary
The paper introduces AlphaForgeBench, a framework for evaluating trading strategies using Large Language Models (LLMs), addressing issues of behavioral instability in trading performance assessments.
Why It Matters
As LLMs are increasingly applied in finance, reliable benchmarks are essential for evaluating their effectiveness in trading. AlphaForgeBench aims to improve the assessment of LLMs by focusing on financial reasoning and strategy formulation, which is crucial for developing robust trading systems.
Key Takeaways
- Current benchmarks for LLMs in trading are unreliable due to behavioral instability.
- AlphaForgeBench reframes LLMs as quantitative researchers, enhancing reproducibility.
- The framework separates reasoning from execution, improving evaluation methods.
- Experiments demonstrate that AlphaForgeBench reduces execution-induced instability.
- This approach aligns LLMs with real-world quantitative research workflows.
Quantitative Finance > Trading and Market Microstructure arXiv:2602.18481 (q-fin) [Submitted on 10 Feb 2026] Title:AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models Authors:Wentao Zhang, Mingxuan Zhao, Jincheng Gao, Jieshun You, Huaiyu Jia, Yilei Zhao, Bo An, Shuo Sun View a PDF of the paper titled AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models, by Wentao Zhang and 7 other authors View PDF Abstract:The rapid advancement of Large Language Models (LLMs) has led to a surge of financial benchmarks, evolving from static knowledge tests to interactive trading simulations. However, current evaluations of real-time trading performance overlook a critical failure mode: severe behavioral instability in sequential decision-making under uncertainty. We empirically show that LLM-based trading agents exhibit extreme run-to-run variance, inconsistent action sequences even under deterministic decoding, and irrational action flipping across adjacent time steps. These issues stem from stateless autoregressive architectures lacking persistent action memory, as well as sensitivity to continuous-to-discrete action mappings in portfolio allocation. As a result, many existing financial trading benchmarks produce unreliable, non-reproducible, and uninformative evaluations. To address these limitations, we propose AlphaForgeBench, a principled framework that reframes LLMs as quantitative researchers rather than ...