Mistral 7B
modelMistral's debut model and a landmark in efficient open-weight LLMs. 7.3B dense parameters with two key innovations: Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) for efficient long-context handling. 32K context.
Outperformed Llama 2 13B on all benchmarks and Llama 1 34B on reasoning, math, and code. MMLU: 60.1%, HellaSwag: 84.0%. Apache 2.0. Spawned an enormous ecosystem of fine-tunes and derivatives across the open-source community.
Model Details
Architecture DENSE
Parameters 7.3B
Context window 32,000
Paper
arXiv: 2310.06825