[2603.01710] Legal RAG Bench: an end-to-end benchmark for legal RAG
About this article
Abstract page for arXiv paper 2603.01710: Legal RAG Bench: an end-to-end benchmark for legal RAG
Computer Science > Computation and Language arXiv:2603.01710 (cs) [Submitted on 2 Mar 2026] Title:Legal RAG Bench: an end-to-end benchmark for legal RAG Authors:Abdur-Rahman Butler, Umar Butler View a PDF of the paper titled Legal RAG Bench: an end-to-end benchmark for legal RAG, by Abdur-Rahman Butler and Umar Butler View PDF Abstract:We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval a...