CASS-NAT：基于CTC对准的单步非自动回旋变压器用于语音识别

论文标题

CASS-NAT：基于CTC对准的单步非自动回旋变压器用于语音识别

CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

论文作者

Fan, Ruchao, Chu, Wei, Chang, Peng, Xiao, Jing

论文摘要

我们提出了一个基于CTC对准的单步非自动回旋变压器（CASS-NAT），以进行语音识别。具体而言，CTC对齐包含（a）解码器输入的令牌数，以及（b）每个令牌的声学时间跨度。该信息用于并行提取每个令牌的声学表示，称为令牌级别的声学嵌入，该嵌入方式替换了自动回应变压器中的单词嵌入（AT）以在解码器中实现并行生成。在推断期间，提出了一种基于错误的比对采样方法应用于CTC输出空间，减少了WER并保留了并行性。实验结果表明，该提出的方法在Librispeech测试清洁/其他没有外部LM的数据集上获得了3.8％/9.1％的WERS，而Aishell1普通话的CER分别为5.8％。与AT基线相比，Cass-NAT在WER上的性能降低，但RTF的速度快51.2倍。当用Oracle CTC比对解码时，没有LM的WER的下限在测试清洁集合中达到2.3％，表明该方法的潜力。

We propose a CTC alignment-based single step non-autoregressive transformer (CASS-NAT) for speech recognition. Specifically, the CTC alignment contains the information of (a) the number of tokens for decoder input, and (b) the time span of acoustics for each token. The information are used to extract acoustic representation for each token in parallel, referred to as token-level acoustic embedding which substitutes the word embedding in autoregressive transformer (AT) to achieve parallel generation in decoder. During inference, an error-based alignment sampling method is proposed to be applied to the CTC output space, reducing the WER and retaining the parallelism as well. Experimental results show that the proposed method achieves WERs of 3.8%/9.1% on Librispeech test clean/other dataset without an external LM, and a CER of 5.8% on Aishell1 Mandarin corpus, respectively1. Compared to the AT baseline, the CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms of RTF. When decoding with an oracle CTC alignment, the lower bound of WER without LM reaches 2.3% on the test-clean set, indicating the potential of the proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题