[2604.00785] Scalable Pretraining of Large Mixture of Experts Language

[2604.00785] Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

arXiv - AI April 02, 2026 4 min read

About this article

Abstract page for arXiv paper 2604.00785: Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

Computer Science > Machine Learning arXiv:2604.00785 (cs) [Submitted on 1 Apr 2026] Title:Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer Authors:Dharma Teja Vooturi, Dhiraj Kalamkar, Dipankar Das, Bharat Kaul View a PDF of the paper titled Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer, by Dharma Teja Vooturi and 3 other authors View PDF HTML (experimental) Abstract:Pretraining Large Language Models (LLMs) from scratch requires massive amount of compute. Aurora super computer is an ExaScale machine with 127,488 Intel PVC (Ponte Vechio) GPU tiles. In this work, we showcase LLM pretraining on Aurora at the scale of 1000s of GPU tiles. Towards this effort, we developed Optimus, an inhouse training library with support for standard large model training techniques. Using Optimus, we first pretrained Mula-1B, a 1 Billion dense model and Mula-7B-A1B, a 7 Billion Mixture of Experts (MoE) model from scratch on 3072 GPU tiles for the full 4 trillion tokens of the OLMoE-mix-0924 dataset. We then demonstrated model scaling by pretraining three large MoE models Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B till 100 Billion tokens on the same dataset. On our largest model Mula-220B-A10B, we pushed the compute scaling from 384 to 12288 GPU tiles and observed scaling efficiency of around 90% at 12288 GPU tiles. We significantly improved the runtime performance of MoE models using custom GPU kernels for...

Originally published on April 02, 2026. Curated by AI News.

Llms

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

submitted by /u/PatienceHistorical70 [link] [comments]

Reddit - Machine Learning · 1 min · about 1 hour ago

Llms

Stop Overcomplicating AI Workflows. This Is the Simple Framework

I’ve been working on building an agentic AI workflow system for business use cases and one thing became very clear very quickly. This is ...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

Lemonade 10.1 released for latest improvements for local LLMs on AMD GPUs & NPUs

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

The Jose robot at the airport is just a trained parrot

Saw the news about Jose, the AI humanoid greeting passengers in California, speaking 50+ languages. Everyone's impressed by the language ...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

[2604.00785] Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

About this article

Related Articles

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Stop Overcomplicating AI Workflows. This Is the Simple Framework

Lemonade 10.1 released for latest improvements for local LLMs on AMD GPUs & NPUs

The Jose robot at the airport is just a trained parrot

No comments

Stay updated with AI News