Distilling the Knowledge in a Neural Network
paperIntroduced knowledge distillation: compressing a large "teacher" model into a smaller "student" by training on the teacher's soft probability distributions rather than hard labels. The softened outputs transfer dark knowledge about inter-class similarities.
Now a universal technique used across frontier labs for model compression, efficient deployment, and training smaller models (GPT-4o mini, Gemma, DeepSeek). By Hinton, Vinyals, and Dean.
Paper
arXiv: 1503.02531