"Robust Speech Recognition via Large-Scale Weak Supervision" — an encoder-decoder Transformer trained on 680,000 hours of multilingual audio data collected from the web. The largest variant (large-v3) has 1.55B parameters. Supports transcription and translation across 99 languages.

Whisper demonstrated that scaling weakly supervised pre-training on diverse audio data approaches human-level robustness without domain-specific fine-tuning. Became the most widely-used open-source speech recognition model, integrated into thousands of applications. ICML 2023. By Radford, Kim, Xu et al. MIT License.

Model Details

Architecture DENSE
Parameters 1.55B

Variants

Name Parameters Notes
Whisper Tiny 39M
Whisper Base 74M
Whisper Small 244M
Whisper Medium 769M
Whisper Large-v3 1.55B

Paper

arXiv: 2212.04356

Venue: ICML 2023

open-sourceopen-weightspeechfoundational