论文标题
唇段:混合感官使用跨模态知识蒸馏的基于单词的模型理解嘴唇
Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models
论文作者
论文摘要
在这项工作中,我们提出了一种将语音识别能力从音频语音识别系统转移到视觉语音识别器的技术,我们的目标是在语言模型培训期间使用音频数据。音频和视听系统已经展示了语音识别领域的令人印象深刻的进步。然而,由于某些音素的视觉歧义,关于视觉语音识别系统仍然有很多探索。为此,鉴于音频模型的不稳定性,视觉语音识别模型的发展至关重要。这项工作的主要贡献是i)通过将序列级别和框架级知识蒸馏(KD)集成到其系统中,以最新的基于单词的唇线阅读模型为基础; ii)在培训视觉模型中利用音频数据,这是在先前基于单词的工作中使用的壮举; iii)在框架级KD中提出高斯形平均,作为一种有效的技术,可帮助模型在序列模型编码器上提炼知识。这项工作提出了一种新颖的竞争性架构,用于唇部阅读,因为我们证明了性能的明显改善,在LRW数据集中将新的基准标准为88.64%。
In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique that aids the model in distilling knowledge at the sequence model encoder. This work proposes a novel and competitive architecture for lip-reading, as we demonstrate a noticeable improvement in performance, setting a new benchmark equals to 88.64% on the LRW dataset.