[2603.25573] Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference
About this article
Abstract page for arXiv paper 2603.25573: Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.25573 (cs) [Submitted on 26 Mar 2026] Title:Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference Authors:Sk Miraj Ahmed, Xi Yu, Yunqi Li, Yuewei Lin, Wei Xu View a PDF of the paper titled Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference, by Sk Miraj Ahmed and 3 other authors View PDF HTML (experimental) Abstract:Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic...