文字感知端到端的错误发音检测和诊断

论文标题

文字感知端到端的错误发音检测和诊断

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

论文作者

Peng, Linkai, Gao, Yingming, Lin, Binghuai, Ke, Dengfeng, Xie, Yanlu, Zhang, Jinsong

论文摘要

错位检测和诊断（MDD）技术是计算机辅助发音训练系统（CAPT）的关键组成部分。在评估受约束语音的发音质量的领域中，给定的转录可以扮演教师的角色。常规方法已充分利用了模型构建或改善系统性能的先前文本，例如强制对准和扩展识别网络。最近，一些基于端到端的方法试图将先前的文本纳入模型训练中，并初步显示出有效性。但是，以前的研究主要考虑将原始注意力机制应用于音频表示与文本表示形式融合，而无需考虑可能的文本传感不匹配。在本文中，我们提出了一种门控策略，该策略在抑制无关的文本信息的同时，对相关音频功能更为重要。此外，鉴于转录，我们设计了额外的对比损失，以减少音素识别和MDD的学习目标之间的差距。我们使用两个公开可用的数据集（Timit和L2-极）进行了实验，而我们的最佳模型将F1分数从57.51美元\％$ $提高到61.75美元\％\％$ $。此外，我们提供了详细的分析，以阐明门控机制和对MDD的对比度学习的有效性。

Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information. Moreover, given the transcriptions, we design an extra contrastive loss to reduce the gap between the learning objective of phoneme recognition and MDD. We conducted experiments using two publicly available datasets (TIMIT and L2-Arctic) and our best model improved the F1 score from $57.51\%$ to $61.75\%$ compared to the baselines. Besides, we provide a detailed analysis to shed light on the effectiveness of gating mechanism and contrastive learning on MDD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题