[2602.12304] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
Summary
The paper introduces OmniCustom, a novel framework for synchronizing audio-video customization, enhancing identity and timbre fidelity through a joint generation model.
Why It Matters
As video content creation becomes increasingly prevalent, the ability to customize audio and video in a synchronized manner is crucial for applications in entertainment, education, and virtual reality. OmniCustom's approach could significantly improve the quality and efficiency of multimedia production.
Key Takeaways
- OmniCustom synchronizes video identity and audio timbre using a joint generation model.
- The framework employs separate modules for identity and audio control, enhancing customization.
- Contrastive learning objectives improve fidelity in generated content.
- Extensive experiments show OmniCustom outperforms existing audio-video generation methods.
- The model is trained on a large-scale dataset, ensuring high-quality outputs.
Computer Science > Sound arXiv:2602.12304 (cs) [Submitted on 12 Feb 2026] Title:OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model Authors:Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu View a PDF of the paper titled OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model, by Maomao Li and 5 other authors View PDF HTML (experimental) Abstract:Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate...