LongCat-Next | Lab Index

Native multimodal foundation model (74B MoE, ~3B active) unifying text, vision, and audio understanding and generation via DiNA (Discrete Native Autoregression). Introduces dNaViT for dynamic visual tokenization and achieves 28x visual compression with strong text rendering. Supports image generation, TTS, voice cloning, and low-latency voice conversation.

Project Page GitHub HuggingFace

Model Details

Architecture MOE

Parameters 74B

Active params 3B

multimodalmoeopen-weightany-to-anyaudiovision

Model Details

Related