"Language Models are Few-Shot Learners" — 175B parameter dense Transformer (96 layers, 12288 hidden, 96 heads) trained on 300B tokens from a filtered CommonCrawl mix. 2048 token context. Demonstrated that scaling alone enables in-context few-shot learning without gradient updates.

GPT-3 was a landmark: it showed that a single model could perform translation, question answering, and code generation via prompting alone. NeurIPS 2020. One of the most-cited AI papers ever (~30K+ citations). Powered the initial ChatGPT and API products. By Brown, Mann, Ryder, Subbiah et al. Proprietary.

Model Details

Architecture DENSE
Parameters 175B
Context window 2,048

Paper

arXiv: 2005.14165

Venue: NeurIPS 2020

frontierfoundational

Related