Baichuan-Omni-1.5
paper modelEnd-to-end trained omni-modal model supporting text, image, video, and audio input with text and audio output. Built upon Qwen2.5-7B, uses a custom audio tokenizer (Baichuan-Audio-Tokenizer) and a multi-stage training strategy with approximately 500B high-quality tokens. Leads contemporary models including GPT-4o-mini in omni-modal capabilities and achieves results comparable to Qwen2-VL-72B on multimodal medical benchmarks.
Outputs 2
Baichuan-Omni-1.5 Technical Report
paperTechnical report describing the omni-modal architecture, data cleaning pipeline, audio tokenizer, and multi-stage training strategy.
arXiv: 2501.15368
Baichuan-Omni-1.5 (model)
modelOpen-source omni-modal model with text, image, video, audio input and text/audio output, built on Qwen2.5-7B.
Architecture DENSE
Parameters 7B