RULER: What's the Real Context Size of Your LLM?

Long-context evaluation framework with 13 tasks across 4 categories: retrieval (single/multi-key needle-in-a-haystack), multi-hop composition, aggregation (counting, frequency), and question answering. Tests genuine context utilization at lengths from 4K to 1M+ tokens, going well beyond simplistic needle-in-a-haystack tests.

Revealed that many models claiming 128K+ context windows degrade significantly on real tasks at those lengths. Became the de facto standard for long-context evaluation — used by model developers (Anthropic, Meta, NVIDIA) to report context-window capabilities. RULER@128K and RULER@1M scores appear on major model cards. By Hsieh et al. (NVIDIA Research).

Paper (arXiv)

Paper

arXiv HTML

benchmarkevaluationlong-context