[2602.18807] Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning
Summary
This study evaluates the effectiveness of GPTutor, an LLM-powered tutoring system, comparing conversational and embedded feedback for enhancing mathematical proof learning among undergraduate students.
Why It Matters
The findings highlight the limitations of chat-based support in educational settings, suggesting that structured feedback may be more effective for learning outcomes. This has implications for the design of AI tutoring systems and their integration into curricula, especially in mathematics.
Key Takeaways
- Chatbot-based support alone may not enhance independent assessment performance in math proof learning.
- Structured proof-review tools provide more reliable feedback compared to conversational chatbots.
- Students with lower self-efficacy utilized both support tools more frequently, indicating a need for tailored interventions.
Computer Science > Human-Computer Interaction arXiv:2602.18807 (cs) [Submitted on 21 Feb 2026] Title:Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning Authors:Eason Chen, Sophia Judicke, Kayla Beigh, Xinyi Tang, Isabel Wang, Nina Yuan, Zimo Xiao, Chuangji Li, Shizhuo Li, Reed Luttmer, Shreya Singh, Maria Yampolsky, Naman Parikh, Yvonne Zhao, Meiyi Chen, Scarlett Huang, Anishka Mohanty, Gregory Johnson, John Mackey, Jionghao Lin, Ken Koedinger View a PDF of the paper titled Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning, by Eason Chen and 20 other authors View PDF HTML (experimental) Abstract:We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automat...