[2602.22223] SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas

[2602.22223] SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas

arXiv - Machine Learning 3 min read Article

Summary

The paper introduces SQaLe, a large-scale text-to-SQL dataset designed to enhance the development of models that convert natural language queries into SQL, addressing the limitations of existing datasets.

Why It Matters

As the demand for effective natural language processing tools grows, SQaLe provides a crucial resource for researchers and developers in the text-to-SQL domain. Its realistic schema and query complexity can significantly improve model training and generalization, fostering advancements in AI applications that rely on database interactions.

Key Takeaways

  • SQaLe consists of 517,676 high-quality (question, schema, query) triples.
  • The dataset is built on 135,875 relational database schemas, enhancing diversity and complexity.
  • SQaLe addresses the lack of large-scale datasets in text-to-SQL research.
  • It captures realistic schema size variability and natural language ambiguity.
  • The dataset is accessible for further research and development in text-to-SQL models.

Computer Science > Information Retrieval arXiv:2602.22223 (cs) [Submitted on 16 Dec 2025] Title:SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas Authors:Cornelius Wolff, Daniel Gomm, Madelon Hulsebos View a PDF of the paper titled SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas, by Cornelius Wolff and Daniel Gomm and Madelon Hulsebos View PDF HTML (experimental) Abstract:Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe: a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data ...

Related Articles

Llms

[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss

TL;DR: Removing the right layers (instead of shrinking all layers) makes transformer models ~8–12% smaller with only ~6–8% quality loss, ...

Reddit - Machine Learning · 1 min ·
Llms

Built a training stability monitor that detects instability before your loss curve shows anything — open sourced the core today

Been working on a weight divergence trajectory curvature approach to detecting neural network training instability. Treats weight updates...

Reddit - Artificial Intelligence · 1 min ·
Llms

This Is Not Hacking. This Is Structured Intelligence.

Watch me demonstrate everything I've been talking about—live, in real time. The Setup: Maestro University AI enrollment system Standard c...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] Howcome Muon is only being used for Transformers?

Muon has quickly been adopted in LLM training, yet we don't see it being talked about in other contexts. Searches for Muon on ConvNets tu...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime