End-to-end trained omni-modal model supporting text, image, video, and audio input with text and audio output. Built upon Qwen2.5-7B, uses a custom audio tokenizer (Baichuan-Audio-Tokenizer) and a multi-stage training strategy with approximately 500B high-quality tokens. Leads contemporary models including GPT-4o-mini in omni-modal capabilities and achieves results comparable to Qwen2-VL-72B on multimodal medical benchmarks.

Outputs 2

Baichuan-Omni-1.5 Technical Report

paper

Technical report describing the omni-modal architecture, data cleaning pipeline, audio tokenizer, and multi-stage training strategy.

arXiv: 2501.15368

Baichuan-Omni-1.5 (model)

model

Open-source omni-modal model with text, image, video, audio input and text/audio output, built on Qwen2.5-7B.

Architecture DENSE
Parameters 7B
open-weightmultimodalaudiovision