[2603.03505] PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation
About this article
Abstract page for arXiv paper 2603.03505: PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.03505 (cs) [Submitted on 3 Mar 2026] Title:PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation Authors:Shang Wu, Chenwei Xu, Zhuofan Xia, Weijian Li, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu View a PDF of the paper titled PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation, by Shang Wu and 8 other authors View PDF HTML (experimental) Abstract:State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization: PhyPrompt-7B reaches 40.8\% joint success on VideoPhy2 (8.6pp gain), improving physical commonsense by 1...