Large-scale coherent interleaved text-video dataset for multi-clip video generation. Contains 340,000+ interleaved sequences of video clips and rich captions (334k train / 8k test), supporting text-and-video-to-video (TV2V) generation. Unlike traditional T2V datasets with isolated clip pairs, CI-VID captures both intra-clip content and inter-clip transitions for story-driven generation with temporal and visual coherence. Full video data totals ~6 TB.
datasetgenerationmultimodal

Related