[2408.03099] Topic Modeling with Fine-tuning LLMs and Bag of Sentences
Summary
This paper presents FT-Topic, a novel approach for topic modeling that fine-tunes large language models (LLMs) using bags of sentences, outperforming traditional methods.
Why It Matters
As topic modeling is crucial for understanding large text datasets, this research enhances the capabilities of LLMs, making them more effective for unsupervised learning tasks. The proposed method could significantly improve the efficiency and accuracy of topic extraction in various applications, from academic research to business intelligence.
Key Takeaways
- FT-Topic improves topic modeling by fine-tuning LLMs with bags of sentences.
- The method automates dataset construction for training, enhancing efficiency.
- SenClu, the resulting model, achieves state-of-the-art performance in topic modeling.
- The approach allows for the integration of prior knowledge into topic-document distributions.
- Fast inference is achieved through an expectation-maximization algorithm.
Computer Science > Computation and Language arXiv:2408.03099 (cs) [Submitted on 6 Aug 2024 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Topic Modeling with Fine-tuning LLMs and Bag of Sentences Authors:Johannes Schneider View a PDF of the paper titled Topic Modeling with Fine-tuning LLMs and Bag of Sentences, by Johannes Schneider View PDF Abstract:Large language models (LLMs) are increasingly used for topic modeling, outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable labeled dataset for fine-tuning. In this paper, we build on the recent idea of using bags of sentences as the elementary unit for computing topics. Based on this idea, we derive an approach called FT-Topic to perform unsupervised fine-tuning, relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method identifies pairs of sentence groups that are assumed to belong either to the same topic or to different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The resulting dataset is then used to fine-tune an encoder LLM, which can be leveraged by any topic modeling approach that uses embeddings. In this work, we demonstrate its effectiveness by deriving a novel state-of-the-art topic modeling method called SenClu. The method achieves f...