[2601.04646] Succeeding at Scale: Automated Dataset Construction and

[2601.04646] Succeeding at Scale: Automated Dataset Construction and Query-Side Adaptation for Multi-Tenant Search

arXiv - AI March 05, 2026 4 min read

About this article

Abstract page for arXiv paper 2601.04646: Succeeding at Scale: Automated Dataset Construction and Query-Side Adaptation for Multi-Tenant Search

Computer Science > Information Retrieval arXiv:2601.04646 (cs) [Submitted on 8 Jan 2026 (v1), last revised 3 Mar 2026 (this version, v3)] Title:Succeeding at Scale: Automated Dataset Construction and Query-Side Adaptation for Multi-Tenant Search Authors:Prateek Jain, Shabari S Nair, Ritesh Goru, Prakhar Agarwal, Ajay Yadav, Yoga Sri Varshan Varadharajan, Constantine Caramanis View a PDF of the paper titled Succeeding at Scale: Automated Dataset Construction and Query-Side Adaptation for Multi-Tenant Search, by Prateek Jain and 6 other authors View PDF HTML (experimental) Abstract:Large-scale multi-tenant retrieval systems generate extensive query logs but lack curated relevance labels for effective domain adaptation, resulting in substantial underutilized "dark data". This challenge is compounded by the high cost of model updates, as jointly fine-tuning query and document encoders requires full corpus re-indexing, which is impractical in multi-tenant settings with thousands of isolated indices. We introduce DevRev-Search, a passage retrieval benchmark for technical customer support built via a fully automated pipeline. Candidate generation uses fusion across diverse sparse and dense retrievers, followed by an LLM-as-a-Judge for consistency filtering and relevance labeling. We further propose an Index-Preserving Adaptation strategy that fine-tunes only the query encoder, achieving strong performance gains while keeping document indices fixed. Experiments on DevRev-Search, S...

Originally published on March 05, 2026. Curated by AI News.

Machine Learning

[R] I trained a 3k parameter model on XOR sequences of length 20. It extrapolates perfectly to length 1,000,000. Here's why I think that's architecturally significant.

I've been working on an alternative to attention-based sequence modeling that I'm calling Geometric Flow Networks (GFN). The core idea: i...

Reddit - Machine Learning · 1 min · 9 minutes ago

Llms

[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearch

Inspired by Andrej Karpathy's AutoResearch, I built a system where Claude Code acts as an autonomous ML researcher on tabular binary clas...

Reddit - Machine Learning · 1 min · about 4 hours ago

Machine Learning

[D] Data curation and targeted replacement as a pre-training alignment and controllability method

Hi, r/MachineLearning: has much research been done in large-scale training scenarios where undesirable data has been replaced before trai...

Reddit - Machine Learning · 1 min · about 4 hours ago

Llms

[R] BraiNN: An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning

BraiNN An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning BraiNN is a compact research‑...

Reddit - Machine Learning · 1 min · about 5 hours ago

[2601.04646] Succeeding at Scale: Automated Dataset Construction and Query-Side Adaptation for Multi-Tenant Search

About this article

Related Articles

[R] I trained a 3k parameter model on XOR sequences of length 20. It extrapolates perfectly to length 1,000,000. Here's why I think that's architecturally significant.

[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearch

[D] Data curation and targeted replacement as a pre-training alignment and controllability method

[R] BraiNN: An Experimental Neural Architecture with Working Memory, Relational Reasoning, and Adaptive Learning

No comments

Stay updated with AI News