[2602.22698] Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs
Summary
This paper presents KGT, a novel framework addressing the granularity mismatch between large language models (LLMs) and knowledge graphs (KGs) by introducing dedicated entity tokens for improved knowledge graph completion.
Why It Matters
As LLMs become increasingly integrated into various AI applications, bridging the gap between their token-based processing and the entity-centric nature of knowledge graphs is crucial for enhancing knowledge graph completion tasks. This research offers a promising solution that could improve the performance of AI systems reliant on knowledge representation.
Key Takeaways
- KGT framework introduces dedicated entity tokens for better feature representation.
- The method fuses structural and textual features using a relation-guided gating mechanism.
- Decoupled prediction allows for independent semantic and structural reasoning.
- Experimental results show KGT outperforms existing state-of-the-art methods.
- This approach addresses the fundamental granularity mismatch in knowledge graphs.
Computer Science > Computation and Language arXiv:2602.22698 (cs) [Submitted on 26 Feb 2026] Title:Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs Authors:Siyue Su, Jian Yang, Bo Li, Guanglin Niu View a PDF of the paper titled Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs, by Siyue Su and Jian Yang and Bo Li and Guanglin Niu View PDF HTML (experimental) Abstract:Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM's vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding trai...