Open-source end-to-end audio large language model integrating speech understanding and generation for real-time bilingual Chinese-English dialogue. Uses multi-codebook discretization at 12.5 Hz frame rate to retain both semantic and acoustic information. Achieves 3.2% WER on Fleurs zh test set, significantly outperforming Whisper-large-v3 (12.4%). Includes the OpenAudio-Bench evaluation benchmark.

Outputs 2

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

paper

Technical paper describing the multi-codebook speech discretization, text-guided aligned speech generation, and two-stage pre-training strategy.

arXiv: 2502.17239

Baichuan-Audio (model)

model

Open-source end-to-end speech interaction model with Base and Instruct variants for bilingual Chinese-English audio dialogue.

Architecture DENSE
Parameters 10B
open-weightaudionlp