Dynamic BERT model with adaptive width and depth, allowing flexible adjustment of model size and latency at runtime. Uses knowledge distillation from full-sized models to smaller sub-networks with network rewiring to share important attention heads. Published at NeurIPS 2020.

Outputs 2

DynaBERT

model

DynaBERT: Dynamic BERT with Adaptive Width and Depth

paper

arXiv: 2004.04037

nlpefficiencytrainingopen-source