[2601.13227] Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?
Abstract page for arXiv paper 2601.13227: Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?
AI startup funding, launches, and acquisitions
Abstract page for arXiv paper 2601.13227: Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?
Abstract page for arXiv paper 2602.00095: EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM...
Abstract page for arXiv paper 2601.13222: Incorporating Q&A Nuggets into Retrieval-Augmented Generation
Abstract page for arXiv paper 2602.10541: FastLSQ: A Framework for One-Shot PDE Solving
Abstract page for arXiv paper 2511.09396: Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque
Abstract page for arXiv paper 2510.26840: SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
Abstract page for arXiv paper 2509.25106: Towards Personalized Deep Research: Benchmarks and Evaluations
Abstract page for arXiv paper 2602.05286: HealthMamba: An Uncertainty-aware Spatiotemporal Graph State Space Model for Effective and Reli...
Abstract page for arXiv paper 2412.13091: LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
Abstract page for arXiv paper 2509.22580: The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?
Abstract page for arXiv paper 2508.06066: Effective Sample Size and Generalization Bounds for Temporal Networks
Abstract page for arXiv paper 2602.09937: Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
Abstract page for arXiv paper 2601.16529: SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters fo...
Abstract page for arXiv paper 2509.21782: Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety
Abstract page for arXiv paper 2505.13033: TSPulse: Tiny Pre-Trained Models with Disentangled Representations for Rapid Time-Series Analysis
Abstract page for arXiv paper 2502.01534: Preference Leakage: A Contamination Problem in LLM-as-a-judge
Abstract page for arXiv paper 2412.06531: Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Abstract page for arXiv paper 2412.01654: FSMLP: Modelling Channel Dependencies With Simplex Theory Based Multi-Layer Perceptions In Freq...
Abstract page for arXiv paper 2603.04356: RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots
Abstract page for arXiv paper 2603.04334: SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints
Abstract page for arXiv paper 2603.04325: Scalable Evaluation of the Realism of Synthetic Environmental Augmentations in Images
Abstract page for arXiv paper 2603.04198: Stable and Steerable Sparse Autoencoders with Weight Regularization
Abstract page for arXiv paper 2603.04162: Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Lan...
Get the latest news, tools, and insights delivered to your inbox.
Daily or weekly digest • Unsubscribe anytime