Comprehensive benchmark for long-context understanding in LLMs.

Evaluation Details

Questions 4,750
Tasks 21
Domains 6
Scoring Automatic task-specific metrics: F1 (most QA tasks and TriviaQA few-shot), ROUGE-L (summarization tasks plus DuReader and SAMSum), classification accuracy (TREC/LSHT and synthetic retrieval/counting tasks), edit similarity (code completion)
Domains: single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, code completion

Dataset

benchmarkscaling