Baichuan-Audio | Lab Index

Open-source end-to-end audio large language model integrating speech understanding and generation for real-time bilingual Chinese-English dialogue. Uses multi-codebook discretization at 12.5 Hz frame rate to retain both semantic and acoustic information. Achieves 3.2% WER on Fleurs zh test set, significantly outperforming Whisper-large-v3 (12.4%). Includes the OpenAudio-Bench evaluation benchmark.

Paper (arXiv)GitHub HuggingFace (Base)HuggingFace (Instruct)

Outputs 2

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

paper

Technical paper describing the multi-codebook speech discretization, text-guided aligned speech generation, and two-stage pre-training strategy.

Paper (arXiv)

arXiv HTML

Baichuan-Audio (model)

model

Open-source end-to-end speech interaction model with Base and Instruct variants for bilingual Chinese-English audio dialogue.

HuggingFace (Base)HuggingFace (Instruct)GitHub

Architecture DENSE

Parameters 10B

open-weightaudionlp