Massive 236B MoE model (21B active) that introduced Multi-head Latent Attention (MLA). Accompanied by a technical report.

Outputs 2

DeepSeek-V2

model
Architecture MOE
Parameters 236B
Active params 21B

DeepSeek-V2 Technical Report

paper

Technical report detailing Multi-head Latent Attention and DeepSeekMoE architecture innovations.

arXiv: 2405.04434

moefrontieropen-weight