[2602.15460] On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

[2602.15460] On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

arXiv - Machine Learning 4 min read Article

Summary

This paper evaluates the out-of-distribution generalization of reasoning in multimodal large language models (LLMs) through a grid-based navigation task, revealing limitations in OOD performance despite improvements in in-distribution generalization.

Why It Matters

Understanding the generalization capabilities of multimodal LLMs is crucial for advancing AI applications in real-world scenarios. This research highlights the challenges in applying reasoning models to unseen data, which is vital for developing robust AI systems that can adapt to new environments and tasks.

Key Takeaways

  • Chain-of-thought (CoT) reasoning enhances in-distribution generalization.
  • Out-of-distribution generalization remains limited, particularly with larger maps.
  • Combining multiple text formats yields better OOD generalization.
  • Text-based models outperform image-based models in this context.
  • A new evaluation framework for multimodal reasoning is proposed.

Computer Science > Machine Learning arXiv:2602.15460 (cs) [Submitted on 17 Feb 2026] Title:On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks Authors:Yannic Neuhaus, Nicolas Flammarion, Matthias Hein, Francesco Croce View a PDF of the paper titled On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks, by Yannic Neuhaus and 3 other authors View PDF HTML (experimental) Abstract:Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-di...

Related Articles

[2603.23966] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage
Llms

[2603.23966] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

Abstract page for arXiv paper 2603.23966: Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

arXiv - AI · 4 min ·
[2603.16790] InCoder-32B: Code Foundation Model for Industrial Scenarios
Llms

[2603.16790] InCoder-32B: Code Foundation Model for Industrial Scenarios

Abstract page for arXiv paper 2603.16790: InCoder-32B: Code Foundation Model for Industrial Scenarios

arXiv - AI · 4 min ·
[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence
Llms

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

Abstract page for arXiv paper 2603.16430: EngGPT2: Sovereign, Efficient and Open Intelligence

arXiv - AI · 4 min ·
[2603.11066] Exploring Collatz Dynamics with Human-LLM Collaboration
Llms

[2603.11066] Exploring Collatz Dynamics with Human-LLM Collaboration

Abstract page for arXiv paper 2603.11066: Exploring Collatz Dynamics with Human-LLM Collaboration

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime