[2603.02655] Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches
About this article
Abstract page for arXiv paper 2603.02655: Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches
Computer Science > Computation and Language arXiv:2603.02655 (cs) [Submitted on 3 Mar 2026] Title:Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches Authors:Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi, Graham Neubig, Florian Matthes, Tatsuya Ishigaki View a PDF of the paper titled Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches, by Anum Afzal and 7 other authors View PDF HTML (experimental) Abstract:Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of ...