[2602.16585] DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows
Summary
DataJoint 2.0 introduces a relational workflow model designed to enhance collaboration in scientific data pipelines, ensuring data integrity and operational rigor.
Why It Matters
This paper addresses the critical need for a unified framework in scientific workflows, where fragmented systems often lead to data corruption. By proposing a comprehensive solution, DataJoint 2.0 aims to improve the reliability of human-agent collaborations in scientific research, making it relevant for researchers and developers in the field.
Key Takeaways
- DataJoint 2.0 offers a relational workflow model to unify data structure and computational transformations.
- The framework enhances operational rigor in scientific workflows, akin to DevOps for data science.
- Innovations include object-augmented schemas and distributed job coordination for improved data integrity.
- Semantic matching prevents erroneous data joins, enhancing the reliability of scientific data pipelines.
- The system is designed for scalability and composability, facilitating collaboration among agents.
Computer Science > Databases arXiv:2602.16585 (cs) [Submitted on 18 Feb 2026] Title:DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows Authors:Dimitri Yatsenko, Thinh T. Nguyen (DataJoint Inc., Houston, USA) View a PDF of the paper titled DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows, by Dimitri Yatsenko and 3 other authors View PDF HTML (experimental) Abstract:Operational rigor determines whether human-agent collaboration succeeds or fails. Scientific data pipelines need the equivalent of DevOps -- SciOps -- yet common approaches fragment provenance across disconnected systems without transactional guarantees. DataJoint 2.0 addresses this gap through the relational workflow model: tables represent workflow steps, rows represent artifacts, foreign keys prescribe execution order. The schema specifies not only what data exists but how it is derived -- a single formal system where data structure, computational dependencies, and integrity constraints are all queryable, enforceable, and machine-readable. Four technical innovations extend this foundation: object-augmented schemas integrating relational metadata with scalable object storage, semantic matching using attribute lineage to prevent erroneous joins, an extensible type system for domain-specific formats, and distributed job coordination designed for composability with external orchestration. By unifying data structure, data, and computational transformations, DataJoint ...