Llms Machine Learning Nlp Data Science

[2602.22223] SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas

arXiv - Machine Learning February 27, 2026 3 min read Article

Summary

The paper introduces SQaLe, a large-scale text-to-SQL dataset designed to enhance the development of models that convert natural language queries into SQL, addressing the limitations of existing datasets.

Why It Matters

As the demand for effective natural language processing tools grows, SQaLe provides a crucial resource for researchers and developers in the text-to-SQL domain. Its realistic schema and query complexity can significantly improve model training and generalization, fostering advancements in AI applications that rely on database interactions.

Key Takeaways

SQaLe consists of 517,676 high-quality (question, schema, query) triples.
The dataset is built on 135,875 relational database schemas, enhancing diversity and complexity.
SQaLe addresses the lack of large-scale datasets in text-to-SQL research.
It captures realistic schema size variability and natural language ambiguity.
The dataset is accessible for further research and development in text-to-SQL models.

Computer Science > Information Retrieval arXiv:2602.22223 (cs) [Submitted on 16 Dec 2025] Title:SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas Authors:Cornelius Wolff, Daniel Gomm, Madelon Hulsebos View a PDF of the paper titled SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas, by Cornelius Wolff and Daniel Gomm and Madelon Hulsebos View PDF HTML (experimental) Abstract:Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe: a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data ...

Read Original Article

[2602.22223] SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas

Summary

Why It Matters

Key Takeaways

Related Articles

[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss

Built a training stability monitor that detects instability before your loss curve shows anything — open sourced the core today

This Is Not Hacking. This Is Structured Intelligence.

[D] Howcome Muon is only being used for Transformers?

No comments

Stay updated with AI News