Whisper
model"Robust Speech Recognition via Large-Scale Weak Supervision" — an encoder-decoder Transformer trained on 680,000 hours of multilingual audio data collected from the web. The largest variant (large-v3) has 1.55B parameters. Supports transcription and translation across 99 languages.
Whisper demonstrated that scaling weakly supervised pre-training on diverse audio data approaches human-level robustness without domain-specific fine-tuning. Became the most widely-used open-source speech recognition model, integrated into thousands of applications. ICML 2023. By Radford, Kim, Xu et al. MIT License.
Model Details
Architecture DENSE
Parameters 1.55B
Variants
| Name | Parameters | Notes |
|---|---|---|
| Whisper Tiny | 39M | — |
| Whisper Base | 74M | — |
| Whisper Small | 244M | — |
| Whisper Medium | 769M | — |
| Whisper Large-v3 | 1.55B | — |
Paper
arXiv: 2212.04356
Venue: ICML 2023