[2504.17203] High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services
Summary
This paper presents a method for generating high-fidelity test data for SQL code generation services, addressing limitations of traditional data generation techniques in handling complex data structures.
Why It Matters
The ability to generate high-fidelity test data is crucial for testing SQL code generation services, especially when production data is unavailable. This research leverages Large Language Models to create semantically and syntactically correct mock data, enhancing testing efficiency and coverage.
Key Takeaways
- Traditional data generation methods struggle with complex SQL structures.
- Leveraging Large Language Models can produce high-fidelity test data.
- The proposed method ensures semantic integrity and structural compliance.
- Improved test coverage for SQL code generation services is achieved.
- The approach addresses the challenges of limited access to production datasets.
Computer Science > Databases arXiv:2504.17203 (cs) [Submitted on 24 Apr 2025 (v1), last revised 25 Feb 2026 (this version, v4)] Title:High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services Authors:Shivasankari Kannan, Yeounoh Chung, Amita Gondi, Tristan Swadell, Fatma Ozcan View a PDF of the paper titled High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services, by Shivasankari Kannan and 3 other authors View PDF HTML (experimental) Abstract:The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post...