利用未配对的文本数据进行培训端到端的语音至重要系统

论文标题

利用未配对的文本数据进行培训端到端的语音至重要系统

Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

论文作者

Huang, Yinghui, Kuo, Hong-Kwang, Thomas, Samuel, Kons, Zvi, Audhkhasi, Kartik, Kingsbury, Brian, Hoory, Ron, Picheny, Michael

论文摘要

培训直接从语音中提取意图的端到端（E2E）神经网络语音到大型系统（S2I）系统需要大量的意图标记的语音数据，这很耗时且收集昂贵。使用接受大量语音数据训练的ASR模型初始化S2I模型可以减轻数据稀疏性。在本文中，我们试图利用NLU文本资源。我们实施了一个基于CTC的S2I系统，该系统与最先进的传统级联SLU系统的性能相匹配。我们使用了不同数量的语音和文本培训数据进行了受控的实验。当只有十分之一的原始数据可用时，意图分类精度的绝对降低了7.6％。假设我们还有其他文本之间的数据（无语音），我们研究了两种改进S2I系统的技术：（1）转移学习，其中用于意图分类的声学嵌入与微调的BERT文本嵌入相关联；（2）使用多演讲者文本到语音的系统将文本之间的数据转换为语音到自然数据的数据增强。建议的方法恢复了由于使用有限标记的语音而导致的80％的性能丧失。

Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system. We performed controlled experiments with varying amounts of speech and text training data. When only a tenth of the original data is available, intent classification accuracy degrades by 7.6% absolute. Assuming we have additional text-to-intent data (without speech) available, we investigated two techniques to improve the S2I system: (1) transfer learning, in which acoustic embeddings for intent classification are tied to fine-tuned BERT text embeddings; and (2) data augmentation, in which the text-to-intent data is converted into speech-to-intent data using a multi-speaker text-to-speech system. The proposed approaches recover 80% of performance lost due to using limited intent-labeled speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题