Ultra-efficient 8B LLM for end devices. Introduced Hybrid Reasoning (dual-mode switching between deep reasoning and fast response) and InfLLM-v2 sparse attention where each token computes relevance with less than 5% of tokens, reducing attention computation by 60%. Achieves 3-7x generation speedup on edge chips.

Outputs 2

MiniCPM4-8B

model
Architecture DENSE
Parameters 8B

MiniCPM4: Ultra-Efficient LLMs on End Devices

paper

arXiv: 2506.07900

on-deviceefficiencyreasoningopen-weight

Related