Bidirectional Encoder Representations from Transformers. Pre-trained deep bidirectional representations by jointly conditioning on both left and right context in all layers via masked language modeling. 110M (Base) and 340M (Large) parameters.

BERT revolutionized NLP, pushing GLUE to 80.5% (+7.7 pts) and SQuAD to 93.2 F1. Spawned an entire generation of models (RoBERTa, ALBERT, DeBERTa, XLNet) and became the dominant approach for search, classification, and NER. NAACL 2019. ~100K+ citations. By Devlin, Chang, Lee, and Toutanova. Apache 2.0.

Model Details

Architecture DENSE
Parameters 340M

Paper

arXiv: 1810.04805

Venue: NAACL 2019

foundationalopen-source

Related