[2510.08919] PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
About this article
Abstract page for arXiv paper 2510.08919: PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.08919 (cs) [Submitted on 10 Oct 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning Authors:Daiki Yoshikawa, Takashi Matsubara View a PDF of the paper titled PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning, by Daiki Yoshikawa and 1 other authors View PDF HTML (experimental) Abstract:Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-pro...