Compressed BERT model using a novel Transformer distillation method. Achieves 96.8% of BERT-base performance on GLUE while being 7.5x smaller and 9.4x faster. Introduces a two-stage learning framework performing distillation at both pre-training and fine-tuning stages. One of the most influential model compression works for NLP.

Outputs 2

TinyBERT

model

TinyBERT: Distilling BERT for Natural Language Understanding

paper

arXiv: 1909.10351

nlpefficiencytrainingopen-source