Ultra-long video understanding model using task-aware KV sparsification. Developed jointly with Shanghai Jiao Tong University. Uses SigLIP-SO400M visual encoder with a Dynamic Token Synthesis module and segmented prefilling strategy. Surpasses all lightweight open-source models on MLVU, VideoMME, and LVBench benchmarks, with performance approaching 72B-scale models. Handles thousands of frames on a single 24GB GPU and 10,000+ frames on 80GB GPUs.
videomultimodalopen-weightefficiency