"Language Models are Unsupervised Multitask Learners" — 1.5B parameter decoder-only Transformer (48 layers, 1600 hidden, 25 heads) trained on WebText (40GB from Reddit links with 3+ karma). 1024 token context. Released in stages due to concerns about misuse — the first major AI safety-motivated staged release.

Demonstrated that sufficiently large language models perform downstream tasks in a zero-shot setting without explicit fine-tuning, achieving SOTA on several benchmarks. GPT-2 became a foundational building block for the open-source community and remains one of the most-used models on HuggingFace. By Radford, Wu, Child, Luan, Amodei, and Sutskever. MIT License.

Model Details

Architecture DENSE
Parameters 1.5B
Context window 1,024

Variants

Name Parameters Notes
GPT-2 Small 124M
GPT-2 Medium 355M
GPT-2 Large 774M
GPT-2 XL 1.5B
open-sourceopen-weightfoundational

Related