Video-XL-2 | Lab Index

Ultra-long video understanding model using task-aware KV sparsification. Developed jointly with Shanghai Jiao Tong University. Uses SigLIP-SO400M visual encoder with a Dynamic Token Synthesis module and segmented prefilling strategy. Surpasses all lightweight open-source models on MLVU, VideoMME, and LVBench benchmarks, with performance approaching 72B-scale models. Handles thousands of frames on a single 24GB GPU and 10,000+ frames on 80GB GPUs.

Paper (arXiv)HuggingFace

videomultimodalopen-weightefficiency