[2603.26798] Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings
About this article
Abstract page for arXiv paper 2603.26798: Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings
Computer Science > Machine Learning arXiv:2603.26798 (cs) [Submitted on 26 Mar 2026] Title:Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings Authors:Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika Mütze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann View a PDF of the paper titled Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings, by Gesina Schwalbe and 10 other authors View PDF HTML (experimental) Abstract:Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate targe...