变压器传感器：一个模型统一流和非流语音识别

论文标题

变压器传感器：一个模型统一流和非流语音识别

Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

论文作者

Tripathi, Anshuman, Kim, Jaeyoung, Zhang, Qian, Lu, Han, Sak, Hasim

论文摘要

在本文中，我们提出了一种变压器变形器模型架构和一种训练技术，以将流和非流语音识别模型统一为一个模型。该模型由一个用于音频编码的变压器层组成，没有lookahead或正确的上下文，以及在具有可变正确上下文的顶部训练的顶部的变压器层堆栈。在推理时间内，可以更改可变上下文层的上下文长度以权衡模型的延迟和准确性。我们还表明，我们可以在Y模型体系结构中运行此模型，其顶层在低潜伏期和高潜伏期模式下并行运行。这使我们能够获得延迟有限和延迟语音识别结果的流语音识别结果，准确性的提高很大（语音搜索任务相对改善为20％）。我们表明，在解码结束时，如果有限的正确上下文（音频1-2秒）和少量的额外延迟（50-100毫秒），我们可以使用无限的音频正确上下文使用模型实现相似的精度。我们还提供了针对音频和标签编码器的优化，以加快流媒体和非流语音解码的推断。

In this paper we present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model. The model is composed of a stack of transformer layers for audio encoding with no lookahead or right context and an additional stack of transformer layers on top trained with variable right context. In inference time, the context length for the variable context layers can be changed to trade off the latency and the accuracy of the model. We also show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes. This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy (20% relative improvement for voice-search task). We show that with limited right context (1-2 seconds of audio) and small additional latency (50-100 milliseconds) at the end of decoding, we can achieve similar accuracy with models using unlimited audio right context. We also present optimizations for audio and label encoders to speed up the inference in streaming and non-streaming speech decoding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题