[2602.20710] Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Summary
The paper introduces Counterfactual Simulation Training (CST), a method designed to enhance Chain-of-Thought (CoT) faithfulness in large language models (LLMs) by rewarding accurate predictions over counterfactual inputs.
Why It Matters
Improving CoT faithfulness is crucial for understanding LLM outputs and ensuring their reliability. CST offers a novel approach that could lead to more accurate and generalizable reasoning in AI systems, addressing significant limitations in current methodologies.
Key Takeaways
- CST significantly improves CoT monitoring accuracy by 35 points.
- The method is more efficient than traditional reinforcement learning approaches.
- Larger models benefit more from CST, indicating its potential for scalability.
- CST outperforms prompting baselines in enhancing model outputs.
- Faithfulness improvements do not generalize to dissuading cues.
Computer Science > Artificial Intelligence arXiv:2602.20710 (cs) [Submitted on 24 Feb 2026] Title:Counterfactual Simulation Training for Chain-of-Thought Faithfulness Authors:Peter Hase, Christopher Potts View a PDF of the paper titled Counterfactual Simulation Training for Chain-of-Thought Faithfulness, by Peter Hase and Christopher Potts View PDF Abstract:Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more eff...