DeepSeek

2026-04-24

DeepSeek's most powerful model family and the first frontier-scale model trained entirely on Huawei Ascend 950PR chips — zero NVIDIA/CUDA depend…

↓ 6.8M

Engram: Conditional Memory via Scalable Lookup

2026-01-12

A new axis of sparsity for large language models. Builds upon finding from Geva et al "Transformer Feed-Forward Layers Are Key-Value Memories" (2021).

mHC: Manifold-Constrained Hyper-Connections

2025-12-31

Proposes Manifold-Constrained Hyper-Connections, projecting residual connection space onto a specific manifold to restore identity mapping while maint…

DeepSeek-Math-V2 2

model dataset

2025-11-27

Specialized model family for formal and informal mathematical reasoning, with curated training datasets.

★ 1.6k ↓ 384

LPLB (Linear-Programming Load Balancer)

2025-11-19

Expert-parallel load balancer using linear programming to optimize MoE workload distribution. Supports Cube, Hypercube, Ring, and Torus topologies. Ea…

★ 505

DeepSeek-OCR / OCR-2

2025-10-20

High-efficiency models for optical character recognition and document understanding.

★ 23.3k ↓ 2.4M

DeepSeek-V3.2 2

2025-09-29

V3.2 family introducing DeepSeek Sparse Attention for long-context efficiency. V3.2-Speciale achieves gold-medal performance on IMO and IOI 2025, surp…

↓ 3.6M 📄 2

DeepSeek-V3.1

2025-08-21

V3.1 released 2025-08-21. V3.1-Terminus (max reasoning variant) released 2025-09-22.

DeepSeek-R1-0528

2025-05-28

Significant upgrade to R1 with enhanced logic and reduced hallucinations.

DeepSeek-Prover-V2

2025-04-30

RL for subgoal decomposition in formal mathematical reasoning. Includes the DeepSeek-ProverBench evaluation suite.

★ 1.3k ↓ 633

DeepSeek-GRM: Inference-Time Scaling for Generalist Reward Modeling

2025-04-03

Generalist reward model with inference-time scaling.

📄 1

3FS (Fire-Flyer File System)

2025-02-28

High-performance distributed file system designed for AI training.

★ 10k

Smallpond

2025-02-28

Lightweight distributed data processing framework built on DuckDB and 3FS. Sorted 110.5 TiB in 30 minutes at 3.66 TiB/min. Released as part of DeepSee…

★ 5k

DualPipe

2025-02-27

Bidirectional pipeline parallelism algorithm for overlapping computation and communication.

★ 3k

EPLB (Expert Parallelism Load Balancer)

2025-02-27

Expert-parallel load balancer for DeepSeek-V3/R1 that uses a redundant expert strategy to replicate heavy-loaded experts across GPUs for balanced infe…

★ 1.4k

DeepGEMM

2025-02-26

Library for clean and efficient FP8 GEMM kernels with fine-grained scaling.

★ 7.4k

DeepEP

2025-02-25

Efficient expert-parallel communication library that bypasses NCCL for MoE communications. Used in DeepSeek models for efficient expert parallelism.

★ 9.7k

FlashMLA

2025-02-24

Highly optimized kernels for Multi-head Latent Attention.

★ 12.7k

NSA: Native Sparse Attention

2025-02-16

Hardware-aligned and natively trainable sparse attention mechanism.

📄 2

DeepSeek-R1

2025-01-22

Incentivizing reasoning capability in LLMs via reinforcement learning. R1-Lite-Preview released 2024-11-20. Full R1 paper 2025-01-20. R1-0528 update r…

↓ 5.4M

DeepSeek-V3 3

2024-12-26

Frontier 671B MoE model with Multi-Token Prediction and FP8 mixed-precision training. V3-0324 update released 2025-03-24. Accompanied by a technical r…

📄 227

DeepSeek-VL2

2024-12-13

MoE vision-language model for advanced multimodal understanding.

★ 5.3k ↓ 2.7k 📄 22

Janus 4

model paper dataset

2024-10-17

Unified autoregressive framework handling both multimodal understanding and visual generation (DALL-E style) in one model. Includes Janus and Janus-Pr…

★ 17.7k ↓ 10.9k 📄 11

DeepSeek-V2.5

2024-09-05

Combination of V2-0628 and Coder-V2-0724 into a unified model.

↓ 7.7k

Auxiliary-Loss-Free Load Balancing Strategy

2024-08-28

A foundational paper for modern Mixture-of-Experts (MoE) architectures that introduces the "Loss-Free Balancing" strategy. It eliminates the tradition…

📄 6

Fire-Flyer AI-HPC: Cost-Effective Software-Hardware Co-Design

2024-08-26

Cost-effective software-hardware co-design for deep learning infrastructure.

DeepSeek-Coder-V2 2

2024-06-17

First open-source MoE code model to beat GPT-4 Turbo on coding benchmarks. The 236B model (21B active) achieved 90.2% on HumanEval, 12.7% on SWE-Bench…

★ 6.8k ↓ 4k 📄 48

DeepSeek-Prover

2024-05-23

Specialized model for formal theorem proving in Lean 4.

★ 577 ↓ 118 📄 6

DeepSeek-V2 2

2024-05-07

Massive 236B MoE model (21B active) that introduced Multi-head Latent Attention (MLA). Accompanied by a technical report.

★ 5k ↓ 5.3k 📄 102

DeepSeek-VL

2024-03-08

Vision-language model with dynamic tiling encoder for high-resolution image understanding.

★ 4.1k ↓ 8.8k 📄 45

DeepSeek-Math 2

2024-02-05

Introduced Group Relative Policy Optimization (GRPO) for mathematical reasoning. The foundational model and paper for DeepSeek's math capabilities.

★ 3.3k 📄 69

DeepSeek-MoE 2

2024-01-11

Pioneering 16B Mixture-of-Experts model with only 2.8B active parameters, setting the stage for DeepSeek's future efficiency focus. Accompanied by a f…

↓ 18.7k 📄 16

DeepSeek-LLM 2

2023-11-29

First general-purpose 67B model, outperforming Llama 2, with a technical report on scaling open-source language models with a long-term vision.

★ 7k ↓ 1.6k 📄 87

DeepSeek-Coder 2