[2510.07231] EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Summary
EconCausal introduces a benchmark for evaluating causal reasoning in large language models, highlighting their limitations in context-dependent scenarios within social sciences.
Why It Matters
Understanding causal relationships in socio-economic contexts is crucial for informed decision-making. This benchmark reveals significant gaps in current LLMs' capabilities, emphasizing the need for improved models in high-stakes environments where misinterpretation can have serious consequences.
Key Takeaways
- EconCausal benchmark includes 10,490 context-annotated causal triplets from empirical studies.
- Current LLMs show a sharp decline in accuracy when faced with context shifts and misinformation.
- Models struggle with recognizing null effects, achieving only 9.5% accuracy in ambiguous cases.
- The findings highlight risks in economic decision-making due to misinterpretation of causal relationships.
- The dataset and benchmark are publicly available for further research and development.
Computer Science > Computation and Language arXiv:2510.07231 (cs) [Submitted on 8 Oct 2025 (v1), last revised 23 Feb 2026 (this version, v3)] Title:EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science Authors:Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park, Jihee Kim View a PDF of the paper titled EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science, by Donggyu Lee and 5 other authors View PDF HTML (experimental) Abstract:Socio-economic causal effects depend heavily on their specific institutional and environmental context. A single intervention can produce opposite results depending on regulatory or market factors, contexts that are often complex and only partially observed. This poses a significant challenge for large language models (LLMs) in decision-support roles: can they distinguish structural causal mechanisms from surface-level correlations when the context changes? To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals. Through a rigorous four-stage pipeline combining multi-run consensus, context refinement, and multi-critic filtering, we ensure each claim is grounded in peer-reviewed research with explicit identification strategies. Our evaluation reveals critical limitations in current LL...