Video-SafetyBench
evalFirst comprehensive safety benchmark for video LVLMs. 2,264 video-text pairs covering 13 unsafe categories and 48 fine-grained subcategories, each pairing a synthesized ~10s video with either a harmful or a benign query. Introduces RJScore (RiskJudgeScore), an LLM-based metric that uses token-level logit distributions to capture judge confidence and align with human safety judgments.
Joint work between BAAI FlagEval and Beijing University of Posts and Telecommunications. Accepted to NeurIPS 2025 Datasets & Benchmarks track.
Paper
Evaluation Details
Questions 2,264
Domains 2
Scoring RJScore (LLM-judge logit distributions)
Domains: video safety, video-text multimodal