论文标题
无监督的视听讲座分段
Unsupervised Audio-Visual Lecture Segmentation
论文作者
论文摘要
在过去的十年中,在线讲座视频变得越来越流行,并且在大流行期间经历了迅速升高。但是,视频语言研究主要集中在教学视频或电影上,并帮助学生浏览不断增长的在线讲座的工具。我们的第一个贡献是通过引入Avlect,由86个课程组成的大规模数据集通过促进教育领域的研究,该数据集由86个课程组成,其中超过2,350个涵盖各种STEM主题的讲座。每个课程都包含视频讲座,成绩单,讲座框架的OCR输出以及可选的讲座笔记,幻灯片,作业和相关教育内容,可以激发各种任务。我们的第二个贡献是介绍视频讲座细分,将演讲分为一口大小的主题,这些主题显示出有望改善学习者的参与度。我们将演讲分割作为无监督的任务,该任务利用了讲座的视觉,文本和OCR提示,而剪辑表示形式则以借口的自我监督任务进行微调,以将叙述与时间偏离的视觉内容相匹配。我们使用这些表示形式使用暂时一致的1-Nearem邻居算法(Tw-finch)生成段。我们在15个课程上评估了我们的方法,并将其与各种视觉和文本基线进行比较,表现优于所有基准。我们的全面消融研究还确定了推动我们方法成功的关键因素。
Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain, by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introducing video lecture segmentation that splits lectures into bite-sized topics that show promise in improving learner engagement. We formulate lecture segmentation as an unsupervised task that leverages visual, textual, and OCR cues from the lecture, while clip representations are fine-tuned on a pretext self-supervised task of matching the narration with the temporally aligned visual content. We use these representations to generate segments using a temporally consistent 1-nearest neighbor algorithm, TW-FINCH. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.