[2603.23311] ARGENT: Adaptive Hierarchical Image-Text Representations
About this article
Abstract page for arXiv paper 2603.23311: ARGENT: Adaptive Hierarchical Image-Text Representations
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.23311 (cs) [Submitted on 24 Mar 2026] Title:ARGENT: Adaptive Hierarchical Image-Text Representations Authors:Chuong Huynh, Hossein Souri, Abhinav Kumar, Vitali Petsiuk, Deen Dayal Mohan, Suren Kumar View a PDF of the paper titled ARGENT: Adaptive Hierarchical Image-Text Representations, by Chuong Huynh and 5 other authors View PDF Abstract:Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, ...