Speculative Decoding

"Fast Inference from Transformers via Speculative Decoding." Uses a small draft model to propose tokens verified in parallel by the large model, achieving 2–3x speedup with mathematically identical outputs.

Now a standard inference optimization used by virtually every LLM serving system (vLLM, TGI, TensorRT-LLM). ICML 2023. By Leviathan, Kalman, and Matias.

No results found