[2505.12707] PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI

[2505.12707] PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI

arXiv - Machine Learning 4 min read Article

Summary

PLAICraft introduces a large-scale dataset capturing time-aligned vision, speech, and action data from multiplayer Minecraft, aimed at advancing embodied AI research.

Why It Matters

This dataset addresses a critical gap in the availability of multi-modal, real-time interaction data necessary for training and evaluating embodied AI agents. By providing over 10,000 hours of gameplay data, it enables researchers to develop more sophisticated AI systems that can operate in complex environments, potentially leading to significant advancements in AI capabilities.

Key Takeaways

  • PLAICraft offers a novel dataset with over 10,000 hours of gameplay data.
  • The dataset includes five time-aligned modalities for comprehensive analysis.
  • It facilitates the study of synchronous embodied behavior in natural environments.
  • An evaluation suite is provided for benchmarking AI capabilities.
  • This work paves the way for advancements in real-time, embodied AI systems.

Computer Science > Machine Learning arXiv:2505.12707 (cs) [Submitted on 19 May 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI Authors:Yingchen He, Christian D. Weilbach, Martyna E. Wojciechowska, Yuxuan Zhang, Frank Wood View a PDF of the paper titled PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI, by Yingchen He and Christian D. Weilbach and Martyna E. Wojciechowska and Yuxuan Zhang and Frank Wood View PDF HTML (experimental) Abstract:Advances in deep generative modeling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants. Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding...

Related Articles

Hub Group Using AI, Machine Learning for Real-Time Visibility of Shipments
Machine Learning

Hub Group Using AI, Machine Learning for Real-Time Visibility of Shipments

AI Events · 4 min ·
Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min ·
Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] Got my first offer after months of searching — below posted range, contract-to-hire, and worried it may pause my search. Do I take it?

I could really use some outside perspective. I’m a senior ML/CV engineer in Canada with about 5–6 years across research and industry. Mas...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime