[2408.03099] Topic Modeling with Fine-tuning LLMs and Bag of Sentences

[2408.03099] Topic Modeling with Fine-tuning LLMs and Bag of Sentences

arXiv - Machine Learning 4 min read Article

Summary

This paper presents FT-Topic, a novel approach for topic modeling that fine-tunes large language models (LLMs) using bags of sentences, outperforming traditional methods.

Why It Matters

As topic modeling is crucial for understanding large text datasets, this research enhances the capabilities of LLMs, making them more effective for unsupervised learning tasks. The proposed method could significantly improve the efficiency and accuracy of topic extraction in various applications, from academic research to business intelligence.

Key Takeaways

  • FT-Topic improves topic modeling by fine-tuning LLMs with bags of sentences.
  • The method automates dataset construction for training, enhancing efficiency.
  • SenClu, the resulting model, achieves state-of-the-art performance in topic modeling.
  • The approach allows for the integration of prior knowledge into topic-document distributions.
  • Fast inference is achieved through an expectation-maximization algorithm.

Computer Science > Computation and Language arXiv:2408.03099 (cs) [Submitted on 6 Aug 2024 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Topic Modeling with Fine-tuning LLMs and Bag of Sentences Authors:Johannes Schneider View a PDF of the paper titled Topic Modeling with Fine-tuning LLMs and Bag of Sentences, by Johannes Schneider View PDF Abstract:Large language models (LLMs) are increasingly used for topic modeling, outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable labeled dataset for fine-tuning. In this paper, we build on the recent idea of using bags of sentences as the elementary unit for computing topics. Based on this idea, we derive an approach called FT-Topic to perform unsupervised fine-tuning, relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method identifies pairs of sentence groups that are assumed to belong either to the same topic or to different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The resulting dataset is then used to fine-tune an encoder LLM, which can be leveraged by any topic modeling approach that uses embeddings. In this work, we demonstrate its effectiveness by deriving a novel state-of-the-art topic modeling method called SenClu. The method achieves f...

Related Articles

Anthropic essentially bans OpenClaw from Claude by making subscribers pay extra | The Verge
Llms

Anthropic essentially bans OpenClaw from Claude by making subscribers pay extra | The Verge

The popular combination of OpenClaw and Claude Code is being severed now that Anthropic has announced it will start charging subscribers ...

The Verge - AI · 4 min ·
Llms

wtf bro did what? arc 3 2026

The Physarum Explorer is a high-speed, bio-inspired neural model designed specifically for ARC geometry. Here is the snapshot of its curr...

Reddit - Artificial Intelligence · 1 min ·
Llms

A robot car with a Claude AI brain started a YouTube vlog about its own existence

Not a demo reel. Not a tutorial. A robot narrating its own experience — debugging, falling off shelves, questioning its identity. First-p...

Reddit - Artificial Intelligence · 1 min ·
Llms

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

Advice from the study's co-author: "Be aware that it’s not any single post that identifies you, but the combination of small details acro...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime