[2602.23234] Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

[2602.23234] Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

arXiv - Machine Learning 4 min read Article

Summary

This article discusses a novel approach to enhancing app store ranking by integrating LLM-generated textual relevance labels with behavioral relevance metrics, leading to improved search relevance and conversion rates.

Why It Matters

As app stores become increasingly competitive, optimizing search relevance is crucial for user satisfaction and conversion rates. This research demonstrates how leveraging LLMs can bridge the gap in textual relevance labeling, providing a scalable solution to enhance search algorithms and improve user engagement.

Key Takeaways

  • Combining behavioral and textual relevance improves app store ranking.
  • Fine-tuned LLMs outperform larger pre-trained models in generating relevant labels.
  • Augmenting rankers with LLM-generated labels leads to measurable increases in conversion rates.

Computer Science > Information Retrieval arXiv:2602.23234 (cs) [Submitted on 26 Feb 2026] Title:Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments Authors:Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Sandip Gaikwad View a PDF of the paper titled Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments, by Evangelia Christakopoulou and 3 other authors View PDF HTML (experimental) Abstract:Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textua...

Related Articles

Llms

Have Companies Began Adopting Claude Co-Work at an Enterprise Level?

Hi Guys, My company is considering purchasing the Claude Enterprise plan. The main two constraints are: - Being able to block usage of Cl...

Reddit - Artificial Intelligence · 1 min ·
Llms

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully ope...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

I've been reviewing how various AI memory systems evaluate their performance and noticed a fundamental issue with cross-system comparison...

Reddit - Machine Learning · 1 min ·
Shifting to AI model customization is an architectural imperative | MIT Technology Review
Llms

Shifting to AI model customization is an architectural imperative | MIT Technology Review

In the early days of large language models (LLMs), we grew accustomed to massive 10x jumps in reasoning and coding capability with every ...

MIT Technology Review · 6 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime