[2602.15460] On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks
Summary
This paper evaluates the out-of-distribution generalization of reasoning in multimodal large language models (LLMs) through a grid-based navigation task, revealing limitations in OOD performance despite improvements in in-distribution generalization.
Why It Matters
Understanding the generalization capabilities of multimodal LLMs is crucial for advancing AI applications in real-world scenarios. This research highlights the challenges in applying reasoning models to unseen data, which is vital for developing robust AI systems that can adapt to new environments and tasks.
Key Takeaways
- Chain-of-thought (CoT) reasoning enhances in-distribution generalization.
- Out-of-distribution generalization remains limited, particularly with larger maps.
- Combining multiple text formats yields better OOD generalization.
- Text-based models outperform image-based models in this context.
- A new evaluation framework for multimodal reasoning is proposed.
Computer Science > Machine Learning arXiv:2602.15460 (cs) [Submitted on 17 Feb 2026] Title:On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks Authors:Yannic Neuhaus, Nicolas Flammarion, Matthias Hein, Francesco Croce View a PDF of the paper titled On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks, by Yannic Neuhaus and 3 other authors View PDF HTML (experimental) Abstract:Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-di...