[2603.27958] CARV: A Diagnostic Benchmark for Compositional Analogical

[2603.27958] CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

arXiv - AI March 31, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.27958: CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

Computer Science > Artificial Intelligence arXiv:2603.27958 (cs) [Submitted on 30 Mar 2026] Title:CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs Authors:Yongkang Du, Xiaohan Zou, Minhao Cheng, Lu Lin View a PDF of the paper titled CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs, by Yongkang Du and 3 other authors View PDF HTML (experimental) Abstract:Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task. Subjects: Artifi...

Originally published on March 31, 2026. Curated by AI News.

Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

Llms

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

Gemini in Google Maps is a surprisingly useful way to explore new territory.

The Verge - AI · 11 min · about 6 hours ago

Llms

The person who replaces you probably won't be AI. It'll be someone from the next department over who learned to use it - opinion/discussion

I'm a strategy person by background. Two years ago I'd write a recommendation and hand it to a product team. Now.. I describe what I want...

Reddit - Artificial Intelligence · 1 min · about 14 hours ago

Llms

Block Resets Management With AI As Cash App Adds Installment Transfers

Block (NYSE:XYZ) plans a permanent organizational overhaul that replaces many middle management roles with AI-driven models to create fla...

AI Tools & Products · 5 min · about 16 hours ago

[2603.27958] CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

About this article

Related Articles

OpenClaw security checklist: practical safeguards for AI agents

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

The person who replaces you probably won't be AI. It'll be someone from the next department over who learned to use it - opinion/discussion

Block Resets Management With AI As Cash App Adds Installment Transfers

No comments

Stay updated with AI News