Machine Learning Generative Ai Nlp

[2505.18883] Partition Generative Modeling: Masked Modeling Without Masks

arXiv - Machine Learning February 18, 2026 4 min read Article

Summary

The paper introduces Partition Generative Models (PGMs), a novel approach to generative modeling that eliminates mask tokens, improving throughput and performance compared to existing masked generative models.

Why It Matters

This research is significant as it addresses limitations in current generative modeling techniques, particularly in efficiency and performance. By proposing a method that enhances throughput while maintaining the advantages of parallel generation, it could influence future developments in machine learning and AI applications.

Key Takeaways

PGMs replace masking with partitioning, allowing for efficient token generation.
They achieve significantly higher throughput compared to traditional masked generative models.
PGMs maintain compatibility with existing MGM samplers and distillation methods.

Computer Science > Machine Learning arXiv:2505.18883 (cs) [Submitted on 24 May 2025 (v1), last revised 17 Feb 2026 (this version, v3)] Title:Partition Generative Modeling: Masked Modeling Without Masks Authors:Justin Deschenaux, Lan Tran, Caglar Gulcehre View a PDF of the paper titled Partition Generative Modeling: Masked Modeling Without Masks, by Justin Deschenaux and 2 other authors View PDF HTML (experimental) Abstract:Masked generative models (MGMs) can generate tokens in parallel and in any order, unlike autoregressive models (ARMs), which decode one token at a time, left-to-right. However, MGMs process the full-length sequence at every sampling step, including mask tokens that carry no information. In contrast, ARMs process only the previously generated tokens. We introduce ``Partition Generative Models'' (PGMs), which replace masking with partitioning. Tokens are split into two groups that cannot attend to each other, and the model learns to predict each group conditioned on the other, eliminating mask tokens entirely. Because the groups do not interact, PGMs can process only the clean tokens during sampling, like ARMs, while retaining parallel, any-order generation, like MGMs. On OpenWebText, PGMs achieve $5-5.5\times$ higher throughput than MDLM while producing samples with lower Generative Perplexity. On ImageNet, PGMs reach comparable FID to MaskGIT with a $7.5\times$ throughput improvement. With twice as many steps, the FID improves to 4.56 while remaining $3....

Read Original Article

Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min · about 1 hour ago

Machine Learning

[R] Architecture Determines Optimization: Deriving Weight Updates from Network Topology (seeking arXiv endorsement - cs.LG)

Abstract: We derive neural network weight updates from first principles without assuming gradient descent or a specific loss function. St...

Reddit - Machine Learning · 1 min · about 3 hours ago

Machine Learning

[P] ML project (XGBoost + Databricks + MLflow) — how to talk about “production issues” in interviews?

Hey all, I recently built an end-to-end fraud detection project using a large banking dataset: Trained an XGBoost model Used Databricks f...

Reddit - Machine Learning · 1 min · about 5 hours ago

Machine Learning

[D] The memory chip market lost tens of billions over a paper this community would have understood in 10 minutes

TurboQuant was teased recently and tens of billions gone from memory chip market in 48 hours but anyone in this community who read the pa...

Reddit - Machine Learning · 1 min · about 5 hours ago

[2505.18883] Partition Generative Modeling: Masked Modeling Without Masks

Summary

Why It Matters

Key Takeaways

Related Articles

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

[R] Architecture Determines Optimization: Deriving Weight Updates from Network Topology (seeking arXiv endorsement - cs.LG)

[P] ML project (XGBoost + Databricks + MLflow) — how to talk about “production issues” in interviews?

[D] The memory chip market lost tens of billions over a paper this community would have understood in 10 minutes

No comments

Stay updated with AI News