Distilling the Knowledge in a Neural Network

Introduced knowledge distillation: compressing a large "teacher" model into a smaller "student" by training on the teacher's soft probability distributions rather than hard labels. The softened outputs transfer dark knowledge about inter-class similarities.

Now a universal technique used across frontier labs for model compression, efficient deployment, and training smaller models (GPT-4o mini, Gemma, DeepSeek). By Hinton, Vinyals, and Dean.

Paper (arXiv)

Paper

arXiv: 1503.02531

foundational