[2603.29029] MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
About this article
Abstract page for arXiv paper 2603.29029: MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.29029 (cs) [Submitted on 30 Mar 2026] Title:MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation Authors:Bharath Krishnamurthy, Ajita Rattani View a PDF of the paper titled MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation, by Bharath Krishnamurthy and Ajita Rattani View PDF HTML (experimental) Abstract:Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply ...