Whisper | Lab Index

"Robust Speech Recognition via Large-Scale Weak Supervision" — an encoder-decoder Transformer trained on 680,000 hours of multilingual audio data collected from the web. The largest variant (large-v3) has 1.55B parameters. Supports transcription and translation across 99 languages.

Whisper demonstrated that scaling weakly supervised pre-training on diverse audio data approaches human-level robustness without domain-specific fine-tuning. Became the most widely-used open-source speech recognition model, integrated into thousands of applications. ICML 2023. By Radford, Kim, Xu et al. MIT License.

Paper (arXiv)GitHub HuggingFace

Model Details

Architecture DENSE

Parameters 1.55B

Variants

Name	Parameters	Notes
Whisper Tiny	39M	—
Whisper Base	74M	—
Whisper Small	244M	—
Whisper Medium	769M	—
Whisper Large-v3	1.55B	—

Paper

arXiv: 2212.04356

Venue: ICML 2023

open-sourceopen-weightspeechfoundational