[2602.23234] Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments
Summary
This article discusses a novel approach to enhancing app store ranking by integrating LLM-generated textual relevance labels with behavioral relevance metrics, leading to improved search relevance and conversion rates.
Why It Matters
As app stores become increasingly competitive, optimizing search relevance is crucial for user satisfaction and conversion rates. This research demonstrates how leveraging LLMs can bridge the gap in textual relevance labeling, providing a scalable solution to enhance search algorithms and improve user engagement.
Key Takeaways
- Combining behavioral and textual relevance improves app store ranking.
- Fine-tuned LLMs outperform larger pre-trained models in generating relevant labels.
- Augmenting rankers with LLM-generated labels leads to measurable increases in conversion rates.
Computer Science > Information Retrieval arXiv:2602.23234 (cs) [Submitted on 26 Feb 2026] Title:Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments Authors:Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Sandip Gaikwad View a PDF of the paper titled Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments, by Evangelia Christakopoulou and 3 other authors View PDF HTML (experimental) Abstract:Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textua...