Native multimodal audio-video joint generation model accepting four input modalities (text, image, audio, video) and producing 4–15 second clips at 480p and 720p with synchronized audio. Supports multi-reference composition: up to 3 video clips, 9 images, and 3 audio clips in a single generation, enabling multi-shot narrative storytelling. Also released as a Fast variant for low-latency scenarios.

Uses a unified large-scale architecture for joint audio-visual generation. Expert evaluations show performance on par with leading video generation models, with substantial improvements across all key sub-dimensions. Seedance 1.5 Pro (Dec 2025) introduced native audio-visual joint generation; 2.0 generalizes this to full multi-modal multi-reference control.

Paper

generationvideoaudiomultimodal

Related