[2505.22914] cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning
Summary
The paper presents 'cadrille', a multi-modal CAD reconstruction model utilizing reinforcement learning to process diverse input data, achieving state-of-the-art results in CAD tasks.
Why It Matters
This research addresses the limitations of existing CAD reconstruction methods that rely on single input modalities. By integrating multiple data types, it enhances the robustness and accessibility of CAD applications, potentially transforming engineering and manufacturing processes.
Key Takeaways
- cadrille processes point clouds, images, and text simultaneously for CAD reconstruction.
- The model employs a two-stage training approach: supervised fine-tuning followed by reinforcement learning.
- cadrille sets new benchmarks in CAD tasks, outperforming single-modal methods.
- The research introduces Group Relative Preference Optimization for RL fine-tuning in CAD.
- Code for the model is publicly available, promoting further research and development.
Computer Science > Computer Vision and Pattern Recognition arXiv:2505.22914 (cs) [Submitted on 28 May 2025 (v1), last revised 17 Feb 2026 (this version, v3)] Title:cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning Authors:Maksim Kolodiazhnyi, Denis Tarasov, Dmitrii Zhemchuzhnikov, Alexander Nikulin, Ilya Zisman, Anna Vorontsova, Anton Konushin, Vladislav Kurenkov, Danila Rukhovich View a PDF of the paper titled cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning, by Maksim Kolodiazhnyi and 8 other authors View PDF HTML (experimental) Abstract:Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fin...