[2603.27958] CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

[2603.27958] CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

arXiv - AI 3 min read

About this article

Abstract page for arXiv paper 2603.27958: CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

Computer Science > Artificial Intelligence arXiv:2603.27958 (cs) [Submitted on 30 Mar 2026] Title:CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs Authors:Yongkang Du, Xiaohan Zou, Minhao Cheng, Lu Lin View a PDF of the paper titled CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs, by Yongkang Du and 3 other authors View PDF HTML (experimental) Abstract:Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task. Subjects: Artifi...

Originally published on March 31, 2026. Curated by AI News.

Related Articles

Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min ·
I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge
Llms

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

Gemini in Google Maps is a surprisingly useful way to explore new territory.

The Verge - AI · 11 min ·
Llms

The person who replaces you probably won't be AI. It'll be someone from the next department over who learned to use it - opinion/discussion

I'm a strategy person by background. Two years ago I'd write a recommendation and hand it to a product team. Now.. I describe what I want...

Reddit - Artificial Intelligence · 1 min ·
Block Resets Management With AI As Cash App Adds Installment Transfers
Llms

Block Resets Management With AI As Cash App Adds Installment Transfers

Block (NYSE:XYZ) plans a permanent organizational overhaul that replaces many middle management roles with AI-driven models to create fla...

AI Tools & Products · 5 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime