[2602.20981] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

[2602.20981] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

arXiv - AI 4 min read Article

Summary

This paper presents MMHNet, a novel multimodal hierarchical network that enhances video-to-audio generation by enabling models to generalize from short to long audio outputs, achieving significant improvements in performance.

Why It Matters

As the demand for high-quality audio generation from video content increases, this research addresses a critical challenge in the field of multimodal AI. By demonstrating that models can effectively generalize from short to long audio, it opens new avenues for applications in media production, accessibility, and entertainment.

Key Takeaways

  • MMHNet significantly improves long audio generation capabilities.
  • The model can generalize from short training instances to longer audio outputs.
  • Achieves state-of-the-art results in video-to-audio benchmarks, outperforming previous methods.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.20981 (cs) [Submitted on 24 Feb 2026] Title:Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models Authors:Christian Simon, MAsato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji View a PDF of the paper titled Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models, by Christian Simon and 10 other authors View PDF HTML (experimental) Abstract:Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkabl...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Using machine learning to identify individuals at risk for intimate partner violence
Machine Learning

Using machine learning to identify individuals at risk for intimate partner violence

Researchers at Mass General Brigham have developed a series of artificial intelligence (AI) tools that uses machine learning to identify ...

AI News - General · 7 min ·
Accelerating science with AI and simulations
Machine Learning

Accelerating science with AI and simulations

MIT Professor Rafael Gómez-Bombarelli discusses the transformative potential of AI in scientific research, emphasizing its role in materi...

AI News - General · 10 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime