[2602.16179] EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
Summary
The paper presents EnterpriseGym Corecraft, a novel high-fidelity reinforcement learning environment designed to train AI agents for generalizable performance in complex tasks, particularly in customer support scenarios.
Why It Matters
This research highlights the importance of high-quality training environments in developing AI agents that can perform real-world tasks effectively. By demonstrating improved task performance and generalization capabilities, it contributes to the ongoing discourse on AI training methodologies and their applications in enterprise settings.
Key Takeaways
- EnterpriseGym Corecraft is a new RL environment for training AI agents.
- AI agents showed improved task performance after training in this environment.
- The study emphasizes the role of environment quality and realism in agent generalization.
- Task-centric design and expert-authored rubrics enhance training effectiveness.
- Results indicate potential applications in enterprise workflows and customer support.
Computer Science > Artificial Intelligence arXiv:2602.16179 (cs) [Submitted on 18 Feb 2026] Title:EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments Authors:Sushant Mehta, Logan Ritchie, Suhaas Garre, Nick Heiner, Edwin Chen View a PDF of the paper titled EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments, by Sushant Mehta and 4 other authors View PDF HTML (experimental) Abstract:We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce \corecraft{}, the first environment in \textsc{EnterpriseGym}, Surge AI's suite of agentic RL environments. \corecraft{} is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30\% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM~4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37\% to 36.76\% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5\% on BFCL Parallel, +7.4\% ...