[2601.01162] Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models
About this article
Abstract page for arXiv paper 2601.01162: Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models
Computer Science > Machine Learning arXiv:2601.01162 (cs) [Submitted on 3 Jan 2026 (v1), last revised 5 Apr 2026 (this version, v2)] Title:Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models Authors:Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung View a PDF of the paper titled Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models, by Zihua Yang and 3 other authors View PDF HTML (experimental) Abstract:Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic-aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to descr...