"Fast Inference from Transformers via Speculative Decoding." Uses a small draft model to propose tokens verified in parallel by the large model, achieving 2–3x speedup with mathematically identical outputs.

Now a standard inference optimization used by virtually every LLM serving system (vLLM, TGI, TensorRT-LLM). ICML 2023. By Leviathan, Kalman, and Matias.

Paper

arXiv: 2211.17192

Venue: ICML 2023

foundationalefficiency