论文标题
有效的基于变压器的端到端语音识别的解码器掩盖
Effective Decoder Masking for Transformer Based End-to-End Speech Recognition
论文作者
论文摘要
基于注意力的编码器模型建模范式在各种语音处理任务(例如自动语音识别(ASR),文本到语音(TTS)等)上取得了令人鼓舞的结果。该范式利用神经网络的概括能力从输入序列到输出序列学习直接映射,而无需求助于先验知识,例如音频文本比对或发音词典。但是,源于该范式的ASR模型容易过度拟合,尤其是在训练数据受到限制时。受到规格和类似于Bert的掩盖语言建模的启发,我们在本文中提出了一种基于解码器掩盖的端到端ASR模型的训练方法。在训练阶段,我们将解码器历史文本输入的某些部分随机替换为符号[bask],以鼓励解码器强劲地输出正确的令牌,即使其解码历史的部分被掩盖或损坏。所提出的方法是通过基于顶部变压器的E2E ASR模型实例化的。与某些现有的强E2E ASR系统相比,在LibrisPeech960H和Tedlium2基准数据集上进行了广泛的实验,证明了我们方法的出色性能。
The attention-based encoder-decoder modeling paradigm has achieved promising results on a variety of speech processing tasks like automatic speech recognition (ASR), text-to-speech (TTS) and among others. This paradigm takes advantage of the generalization ability of neural networks to learn a direct mapping from an input sequence to an output sequence, without recourse to prior knowledge such as audio-text alignments or pronunciation lexicons. However, ASR models stemming from this paradigm are prone to overfitting, especially when the training data is limited. Inspired by SpecAugment and BERT-like masked language modeling, we propose in the paper a decoder masking based training approach for end-to-end (E2E) ASR models. During the training phase we randomly replace some portions of the decoder's historical text input with the symbol [mask], in order to encourage the decoder to robustly output a correct token even when parts of its decoding history are masked or corrupted. The proposed approach is instantiated with the top-of-the-line transformer-based E2E ASR model. Extensive experiments on the Librispeech960h and TedLium2 benchmark datasets demonstrate the superior performance of our approach in comparison to some existing strong E2E ASR systems.