[2510.18573] Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
About this article
Abstract page for arXiv paper 2510.18573: Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.18573 (cs) [Submitted on 21 Oct 2025 (v1), last revised 4 Mar 2026 (this version, v2)] Title:Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model Authors:Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, Meng Wang View a PDF of the paper titled Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model, by Zhenxing Zhang and 8 other authors View PDF HTML (experimental) Abstract:We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filte...