[2603.03512] Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks
About this article
Abstract page for arXiv paper 2603.03512: Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks
Computer Science > Computers and Society arXiv:2603.03512 (cs) [Submitted on 3 Mar 2026] Title:Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks Authors:Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn View a PDF of the paper titled Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks, by Danielle S. Fox and 3 other authors View PDF Abstract:Teachers face increasing demands on their time, particularly in adapting mathematics curricula to meet individual student needs while maintaining cognitive rigor. This study evaluates whether AI tools can accurately classify the cognitive demand of mathematical tasks, which is important for creating or adapting tasks that support student learning. We tested eleven AI tools: six general-purpose (ChatGPT, Claude, DeepSeek, Gemini, Grok, Perplexity) and five education-specific (Brisk, Coteach AI, Khanmigo, Magic School, this http URL), on their ability to categorize mathematics tasks across four levels of cognitive demand using a research-based framework. The goal was to approximate the performance teachers will achieve with straightforward prompts. On average, AI tools accurately classified cognitive demand in only 63% of cases. Education-specific tools were not more accurate than general-purpose tools, and no tool exceeded 83% accuracy. All tools struggled with tasks at the extremes of cognitive demand (Memorization and Doing Mathemat...