MiniCPM
model paperThe original MiniCPM series proving that small models can rival much larger ones. This work introduced the **Warmup-Stable-Decay (WSD)** learning rate scheduler, which popularized the concept of **midtraining** (or annealing). By maintaining a high learning rate for a "stable" period and only decaying in the final 10% of training while introducing high-quality data, the 2.4B model achieved performance parity with 7B-13B models. This scheduler also enables continuous training and efficient scaling law research without pre-defined token budgets.
Outputs 3
MiniCPM-1B / 2B
model Architecture DENSE
Variants
| Name | Parameters | Notes |
|---|---|---|
| MiniCPM-1B | 1B | — |
| MiniCPM-2B | 2B | — |
MiniCPM: Unveiling the Potential of End-Side Large Language Models
paperarXiv: 2404.06395
MiniCPM-MoE-8x2B
modelMoE version delivering 7B-class performance with significantly lower active parameter costs.
Architecture MOE