Machine Learning Generative Ai Computer Vision

[2602.12304] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

arXiv - AI February 16, 2026 4 min read Article

Summary

The paper introduces OmniCustom, a novel framework for synchronizing audio-video customization, enhancing identity and timbre fidelity through a joint generation model.

Why It Matters

As video content creation becomes increasingly prevalent, the ability to customize audio and video in a synchronized manner is crucial for applications in entertainment, education, and virtual reality. OmniCustom's approach could significantly improve the quality and efficiency of multimedia production.

Key Takeaways

OmniCustom synchronizes video identity and audio timbre using a joint generation model.
The framework employs separate modules for identity and audio control, enhancing customization.
Contrastive learning objectives improve fidelity in generated content.
Extensive experiments show OmniCustom outperforms existing audio-video generation methods.
The model is trained on a large-scale dataset, ensuring high-quality outputs.

Computer Science > Sound arXiv:2602.12304 (cs) [Submitted on 12 Feb 2026] Title:OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model Authors:Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu View a PDF of the paper titled OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model, by Maomao Li and 5 other authors View PDF HTML (experimental) Abstract:Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate...

Read Original Article

Machine Learning

[P] Looking for people who have had training runs fail unexpectedly to beta test a stability monitor. Free, takes 5 minutes to add to your existing loop. DM me.

Anyone actively training models want to try a stability monitor on a real run? Trying to get real world validation outside my own benchma...

Reddit - Machine Learning · 1 min · 27 minutes ago

Llms

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Last week, a team from Stanford and UCSF (Asadi, O'Sullivan, Fei-Fei Li, Euan Ashley et al.) dropped two companion papers. The first, MAR...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Machine Learning

Yupp shuts down after raising $33M from a16z crypto's Chris Dixon | TechCrunch

Less than a year after launching, with checks from some of the biggest names in Silicon Valley, crowdsourced AI model feedback startup Yu...

TechCrunch - AI · 4 min · about 4 hours ago

Machine Learning

[R] Fine-tuning services report

If you have some data and want to train or run a small custom model but don't have powerful enough hardware for training, fine-tuning ser...

Reddit - Machine Learning · 1 min · about 7 hours ago

[2602.12304] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Summary

Why It Matters

Key Takeaways

Related Articles

[P] Looking for people who have had training runs fail unexpectedly to beta test a stability monitor. Free, takes 5 minutes to add to your existing loop. DM me.

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Yupp shuts down after raising $33M from a16z crypto's Chris Dixon | TechCrunch

[R] Fine-tuning services report

No comments

Stay updated with AI News