[2602.17053] RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Summary
The paper introduces RFEval, a benchmark for assessing reasoning faithfulness in large reasoning models, highlighting issues of unfaithfulness despite accuracy.
Why It Matters
As AI systems increasingly influence decision-making, ensuring the reliability of their reasoning processes is critical. This research provides a framework to evaluate and improve the trustworthiness of large reasoning models, emphasizing that accuracy alone is insufficient for reliable AI.
Key Takeaways
- RFEval benchmarks reasoning faithfulness with 7,186 instances across seven tasks.
- 49.7% of outputs from evaluated models showed unfaithfulness, primarily due to stance inconsistency.
- Accuracy does not reliably indicate reasoning faithfulness, necessitating new evaluation methods.
- Failures are more common in specific domains like math and code, linked to post-training regimes.
- Trustworthy AI requires optimizing both outcomes and the integrity of reasoning processes.
Computer Science > Artificial Intelligence arXiv:2602.17053 (cs) [Submitted on 19 Feb 2026] Title:RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models Authors:Yunseok Han, Yejoon Lee, Jaeyoung Do View a PDF of the paper titled RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models, by Yunseok Han and 2 other authors View PDF HTML (experimental) Abstract:Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of s...