[2510.05684] D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
About this article
Abstract page for arXiv paper 2510.05684: D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Computer Science > Artificial Intelligence arXiv:2510.05684 (cs) [Submitted on 7 Oct 2025 (v1), last revised 3 Mar 2026 (this version, v3)] Title:D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI Authors:Suhwan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee View a PDF of the paper titled D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI, by Suhwan Choi and 9 other authors View PDF HTML (experimental) Abstract:Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) t...