[2503.23377] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

[2503.23377] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

arXiv - AI 4 min read Article

Summary

The paper presents JavisDiT, a novel Joint Audio-Video Diffusion Transformer that enhances synchronized audio-video generation through a hierarchical spatio-temporal alignment mechanism.

Why It Matters

JavisDiT addresses the challenge of generating synchronized audio and video content, which is crucial for applications in media production, gaming, and virtual reality. The introduction of a new benchmark, JavisBench, sets a standard for evaluating synchronization in complex scenarios, pushing the boundaries of generative AI capabilities.

Key Takeaways

  • JavisDiT utilizes a Hierarchical Spatial-Temporal Synchronized Prior for improved audio-video synchronization.
  • The model significantly outperforms existing methods in generating high-quality synchronized audio-video content.
  • JavisBench, a new benchmark dataset, consists of over 10,000 text-captioned videos for synchronization evaluation.
  • A robust metric for measuring synchrony between generated audio and video pairs has been developed.
  • The research contributes to advancements in generative AI, particularly in multimedia applications.

Computer Science > Computer Vision and Pattern Recognition arXiv:2503.23377 (cs) [Submitted on 30 Mar 2025 (v1), last revised 22 Feb 2026 (this version, v2)] Title:JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization Authors:Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Jiebo Luo, Ziwei Liu, Hao Fei, Tat-Seng Chua View a PDF of the paper titled JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization, by Kai Liu and 10 other authors View PDF HTML (experimental) Abstract:This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Based on the powerful Diffusion Transformer (DiT) architecture, JavisDiT simultaneously generates high-quality audio and video content from open-ended user prompts in a unified framework. To ensure audio-video synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, which consists of 10,140 high-quality text-captioned sounding videos and focuses on synchronization evaluation in diverse and complex real-world scenarios. Further, we specifically ...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
[2603.23899] SM-Net: Learning a Continuous Spectral Manifold from Multiple Stellar Libraries
Machine Learning

[2603.23899] SM-Net: Learning a Continuous Spectral Manifold from Multiple Stellar Libraries

Abstract page for arXiv paper 2603.23899: SM-Net: Learning a Continuous Spectral Manifold from Multiple Stellar Libraries

arXiv - AI · 4 min ·
[2603.16629] MLLM-based Textual Explanations for Face Comparison
Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime