[2601.09605] Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets
Summary
The paper presents MANGO, a novel image translation method that enhances viewpoint robustness in robot manipulation policies using fixed-camera datasets, outperforming existing methods.
Why It Matters
As robotics increasingly relies on vision-based policies, addressing the challenges posed by varying camera viewpoints is crucial. MANGO's approach allows for effective training with limited real-world data, improving the adaptability of robotic systems in diverse environments. This research contributes to advancing robotics and AI by enabling more reliable and versatile manipulation capabilities.
Key Takeaways
- MANGO employs a segmentation-conditioned InfoNCE loss to enhance image translation.
- The method significantly improves success rates in real-world manipulation tasks by over 40 percentage points.
- MANGO requires only a small amount of real-world data to generate diverse viewpoints.
- The approach addresses the sim2real challenge effectively, bridging the gap between simulated and real-world data.
- This research highlights the importance of viewpoint consistency in training robust robotic policies.
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.09605 (cs) [Submitted on 14 Jan 2026 (v1), last revised 13 Feb 2026 (this version, v3)] Title:Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets Authors:Jeremiah Coholich, Justin Wit, Robert Azarcon, Zsolt Kira View a PDF of the paper titled Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets, by Jeremiah Coholich and 3 other authors View PDF HTML (experimental) Abstract:Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this setting, MANGO outperforms all other image...