AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

A universal audio-generation framework that unifies speech, sound, and music in a single model by extending autoregressive next-token prediction from discrete tokens to continuous audio latents: a thin flow-matching head replaces the softmax to predict rectified-flow velocities at each position, and a block-causal AR-Flow attention pattern enables arbitrary-length output. An Asymmetric Mixture-of-Modality-Experts (A-MoME) design handles the text–audio alignment asymmetry between speech and sound/music tasks.

Built on a Qwen3-1.7B backbone, trained for 300k steps on 8× A800 GPUs. It ranks first or second on every evaluated metric across all three modalities — LibriTTS speech (WER 0.020, SIM 0.668), AudioCaps sound (FAD 1.95), and Song-Describer music (FAD 2.02). By the Tongyi Fun Team at Alibaba (corresponding author Qian Chen, with Wen Wang, Bin Ma, Xiangang Li) in collaboration with HKUST (lead author Huadai Liu, and Wei Xue).

Paper (arXiv)Project page

Paper

arXiv HTML

Authors: Huadai Liu · Kaicheng Luo · Wen Wang · Qian Chen · Bin Ma · Xiangang Li · Wei Xue

audiospeechmusicgenerationresearch