Native multimodal foundation model (74B MoE, ~3B active) unifying text, vision, and audio understanding and generation via DiNA (Discrete Native Autoregression). Introduces dNaViT for dynamic visual tokenization and achieves 28x visual compression with strong text rendering. Supports image generation, TTS, voice cloning, and low-latency voice conversation.

Model Details

Architecture MOE
Parameters 74B
Active params 3B
multimodalmoeopen-weightany-to-anyaudiovision

Related