Chinese-LiPS
datasetChinese audio-visual speech recognition dataset combining lip-reading and presentation slides. Contains 100 hours of speech, video, and manual transcription with 36,208 video clips featuring 207 professional speakers. Audio in stereo WAV at 48 kHz; video at 1080p (slides) and 720p (lip). Lip-reading and slides improve ASR by ~8% and ~25% respectively, with ~35% combined improvement. Published at IEEE conference.