[2602.22223] SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas
Summary
The paper introduces SQaLe, a large-scale text-to-SQL dataset designed to enhance the development of models that convert natural language queries into SQL, addressing the limitations of existing datasets.
Why It Matters
As the demand for effective natural language processing tools grows, SQaLe provides a crucial resource for researchers and developers in the text-to-SQL domain. Its realistic schema and query complexity can significantly improve model training and generalization, fostering advancements in AI applications that rely on database interactions.
Key Takeaways
- SQaLe consists of 517,676 high-quality (question, schema, query) triples.
- The dataset is built on 135,875 relational database schemas, enhancing diversity and complexity.
- SQaLe addresses the lack of large-scale datasets in text-to-SQL research.
- It captures realistic schema size variability and natural language ambiguity.
- The dataset is accessible for further research and development in text-to-SQL models.
Computer Science > Information Retrieval arXiv:2602.22223 (cs) [Submitted on 16 Dec 2025] Title:SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas Authors:Cornelius Wolff, Daniel Gomm, Madelon Hulsebos View a PDF of the paper titled SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas, by Cornelius Wolff and Daniel Gomm and Madelon Hulsebos View PDF HTML (experimental) Abstract:Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe: a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data ...