[2602.22549] DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
Summary
DrivePTS introduces a progressive learning framework for generating diverse driving scenes, enhancing fidelity and controllability in autonomous driving systems.
Why It Matters
As autonomous driving technology advances, generating high-quality driving scenes is crucial for validating system robustness. DrivePTS addresses limitations in existing methods, improving scene generation through innovative techniques that enhance both semantic and structural fidelity, which is vital for real-world applications.
Key Takeaways
- DrivePTS employs a progressive learning strategy to reduce inter-dependency among geometric conditions.
- Utilizes a Vision-Language Model for detailed multi-view hierarchical scene descriptions.
- Introduces frequency-guided structure loss to enhance foreground detail and visual fidelity.
- Achieves state-of-the-art results in generating diverse driving scenes, including rare scenarios.
- Demonstrates strong generalization capabilities compared to prior methods.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.22549 (cs) [Submitted on 26 Feb 2026] Title:DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation Authors:Zhechao Wang, Yiming Zeng, Lufan Ma, Zeqing Fu, Chen Bai, Ziyao Lin, Cheng Lu View a PDF of the paper titled DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation, by Zhechao Wang and 6 other authors View PDF HTML (experimental) Abstract:Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to miti...