Multimodal diffusion model with representation alignment for high-fidelity Foley audio generation synced to video content.

Paper

audiovideogeneration