MiniCPM4
model paperUltra-efficient 8B LLM for end devices. Introduced Hybrid Reasoning (dual-mode switching between deep reasoning and fast response) and InfLLM-v2 sparse attention where each token computes relevance with less than 5% of tokens, reducing attention computation by 60%. Achieves 3-7x generation speedup on edge chips.