[2602.14404] Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces
Summary
This study explores the efficacy of reasoning traces in neural networks, introducing a large dataset to assess how well models generalize across varying task complexities.
Why It Matters
Understanding the strengths and limitations of reasoning traces in AI models is crucial for improving their performance in complex tasks. This research provides insights into how task topology affects generalization, which can inform future developments in AI reasoning capabilities.
Key Takeaways
- Introduces PITA, a dataset with over 23 million propositional logic statements.
- Finds that reasoning trace models perform well on broad, shallow tasks but struggle with narrow, deep tasks.
- Proposes new metrics for evaluating task complexity: task depth and task breadth.
Computer Science > Artificial Intelligence arXiv:2602.14404 (cs) [Submitted on 16 Feb 2026] Title:Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces Authors:William L. Tong, Ege Cakar, Cengiz Pehlevan View a PDF of the paper titled Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces, by William L. Tong and 2 other authors View PDF HTML (experimental) Abstract:Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deterio...