论文标题
两次通行的低潜伏期端到端口语理解
Two-Pass Low Latency End-to-End Spoken Language Understanding
论文作者
论文摘要
端到端(E2E)模型在口语理解(SLU)系统中变得越来越流行,并开始为基于管道的方法实现竞争性能。但是,最近的工作表明,这些模型努力以相同的意图概括为新的措辞,这表明模型无法理解给定话语的语义内容。在这项工作中,我们在E2E-SLU框架内的未标记文本数据上进行了预先训练的语言模型,以构建强大的语义表示。同时结合语义和声学信息可以增加推理时间,从而在诸如语音助手之类的应用程序中部署时会导致高潜伏期。我们开发了一个2频道的SLU系统,该系统使用音频的几秒钟的声学信息在第一张通过中进行了低潜伏期的预测,并通过结合语义和声学表示,在第二次通过中进行了更高质量的预测。我们从先前的2次端到端语音识别系统上的工作中汲取灵感,该系统同时使用审议网络访问音频和第一届通道假设。所提出的2个通用SLU系统在流利的语音命令挑战集和SLURP数据集上优于基于声学的SLU模型,并减少了延迟,从而改善了用户体验。作为ESPNET-SLU工具包的一部分,我们的代码和模型公开可用。
End-to-end (E2E) models are becoming increasingly popular for spoken language understanding (SLU) systems and are beginning to achieve competitive performance to pipeline-based approaches. However, recent work has shown that these models struggle to generalize to new phrasings for the same intent indicating that models cannot understand the semantic content of the given utterance. In this work, we incorporated language models pre-trained on unlabeled text data inside E2E-SLU frameworks to build strong semantic representations. Incorporating both semantic and acoustic information can increase the inference time, leading to high latency when deployed for applications like voice assistants. We developed a 2-pass SLU system that makes low latency prediction using acoustic information from the few seconds of the audio in the first pass and makes higher quality prediction in the second pass by combining semantic and acoustic representations. We take inspiration from prior work on 2-pass end-to-end speech recognition systems that attends on both audio and first-pass hypothesis using a deliberation network. The proposed 2-pass SLU system outperforms the acoustic-based SLU model on the Fluent Speech Commands Challenge Set and SLURP dataset and reduces latency, thus improving user experience. Our code and models are publicly available as part of the ESPnet-SLU toolkit.