[2510.19753] Transformers Provably Learn Algorithmic Solutions for Graph Connectivity, But Only with the Right Data
Summary
The paper explores how Transformers can learn algorithmic solutions for graph connectivity, demonstrating that success depends on the training data's alignment with model capacity.
Why It Matters
Understanding the conditions under which Transformers can effectively learn complex algorithms is crucial for advancing machine learning applications. This research highlights the importance of data selection in training AI models, which can lead to more reliable and generalizable outcomes in various computational tasks.
Key Takeaways
- Transformers can learn algorithmic solutions but often rely on heuristics.
- The learning success hinges on training data being within the model's capacity.
- Restricting training data to within-capacity graphs enhances learning accuracy.
- The study provides theoretical and empirical insights into model training dynamics.
- Understanding these dynamics can improve the design of AI systems for complex tasks.
Computer Science > Machine Learning arXiv:2510.19753 (cs) [Submitted on 22 Oct 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:Transformers Provably Learn Algorithmic Solutions for Graph Connectivity, But Only with the Right Data Authors:Qilin Ye, Deqing Fu, Robin Jia, Vatsal Sharan View a PDF of the paper titled Transformers Provably Learn Algorithmic Solutions for Graph Connectivity, But Only with the Right Data, by Qilin Ye and 3 other authors View PDF HTML (experimental) Abstract:Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the Disentangled Transformer, and prove that an $L$-layer model can compute connectivity in graphs with diameters up to $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. By analyzing training dynamics, we prove that whether the model learns this strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $\leq 3^L$) drive the learning of the algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically show that restricting training data to stay within a model's capacity makes both standard and Disentangled Transformers learn the exact algorithm. Subjects: Machine L...