[2602.22072] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Summary
This article explores the robustness of Theory of Mind (ToM) in large language models (LLMs) through perturbation tasks, revealing significant performance drops and the nuanced effects of Chain-of-Thought prompting.
Why It Matters
Understanding the limitations of LLMs in exhibiting Theory of Mind capabilities is crucial for advancing AI development. This study highlights the need for careful evaluation of AI reasoning processes, especially in tasks requiring complex understanding of others' mental states.
Key Takeaways
- LLMs show a steep decline in ToM capabilities when faced with task perturbations.
- Chain-of-Thought prompting can enhance ToM performance but may degrade accuracy in certain scenarios.
- A new annotated ToM dataset was introduced to facilitate further research in this area.
- The study questions the robustness of ToM in LLMs, suggesting selective application of prompting techniques.
- Metrics for evaluating reasoning chain correctness were proposed, aiding future AI assessments.
Computer Science > Computation and Language arXiv:2602.22072 (cs) [Submitted on 25 Feb 2026] Title:Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models Authors:Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek View a PDF of the paper titled Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models, by Christian Nickel and 3 other authors View PDF HTML (experimental) Abstract:Theory of Mind (ToM) refers to an agent's ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance performance and explain the LLM's decision. We introduce a handcrafted, richly annotated ToM dataset, including classic and perturbed false belief tasks, the corresponding spaces of valid reasoning chains for correct task completion, subsequent reasoning faithfulness, task solutions, and propose metrics to evaluate reasoning chain correctness and to what extent final answers are faithful to reasoning traces of the generated CoT. We show a steep drop in ToM capabilities under task perturbation for all evaluated LLMs, questioning the notion of any robust form of ToM being present. While CoT prompting improves the ToM performance overall in a faithful manner, it surprisingly de...