[2603.04743] DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval
About this article
Abstract page for arXiv paper 2603.04743: DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval
Computer Science > Information Retrieval arXiv:2603.04743 (cs) [Submitted on 5 Mar 2026] Title:DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval Authors:Maojun Sun, Yue Wu, Yifei Xie, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang View a PDF of the paper titled DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval, by Maojun Sun and 7 other authors View PDF HTML (experimental) Abstract:Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE ach...