Machine Learning Ai Agents Ai Infrastructure

[2602.16179] EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

arXiv - Machine Learning February 19, 2026 4 min read Article

Summary

The paper presents EnterpriseGym Corecraft, a novel high-fidelity reinforcement learning environment designed to train AI agents for generalizable performance in complex tasks, particularly in customer support scenarios.

Why It Matters

This research highlights the importance of high-quality training environments in developing AI agents that can perform real-world tasks effectively. By demonstrating improved task performance and generalization capabilities, it contributes to the ongoing discourse on AI training methodologies and their applications in enterprise settings.

Key Takeaways

EnterpriseGym Corecraft is a new RL environment for training AI agents.
AI agents showed improved task performance after training in this environment.
The study emphasizes the role of environment quality and realism in agent generalization.
Task-centric design and expert-authored rubrics enhance training effectiveness.
Results indicate potential applications in enterprise workflows and customer support.

Computer Science > Artificial Intelligence arXiv:2602.16179 (cs) [Submitted on 18 Feb 2026] Title:EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments Authors:Sushant Mehta, Logan Ritchie, Suhaas Garre, Nick Heiner, Edwin Chen View a PDF of the paper titled EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments, by Sushant Mehta and 4 other authors View PDF HTML (experimental) Abstract:We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce \corecraft{}, the first environment in \textsc{EnterpriseGym}, Surge AI's suite of agentic RL environments. \corecraft{} is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30\% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM~4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37\% to 36.76\% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5\% on BFCL Parallel, +7.4\% ...

Read Original Article

[2602.16179] EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Summary

Why It Matters

Key Takeaways

Related Articles

New framework for reading AI internal states — implications for alignment monitoring (open-access paper)

Anthropic's Latest AI Model Sends a Shockwave Through Software Stocks

The Gemini app can now generate interactive simulations and models.

The fear over Anthropic’s new AI model Mythos

No comments

Stay updated with AI News