一个通用的多任务学习框架，用于利用文本数据作为文本任务的语音数据

论文标题

一个通用的多任务学习框架，用于利用文本数据作为文本任务的语音数据

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

论文作者

Tang, Yun, Pino, Juan, Wang, Changhan, Ma, Xutai, Genzel, Dmitriy

论文摘要

基于注意的序列到序列建模为需要将一个序列映射到其他序列的应用提供了强大而优雅的解决方案。它的成功在很大程度上取决于大量培训数据的可用性。这对语音应用程序提出了一个挑战，在语音应用程序中，标记的语音数据非常昂贵，例如自动语音识别（ASR）和语音翻译（ST）。在这项研究中，我们提出了一个一般的多任务学习框架，以利用ASR和ST任务的文本数据。提议分别与ASR和ST任务共同培训两个辅助任务，即自动编码器任务和机器翻译任务。我们证明，将文本输入表示为音素序列可以减少语音和文本输入之间的差异，并增强从文本语料库到语音到文本任务的知识转移。我们的实验表明，所提出的方法与基线相比，在英语Librispeech任务上降低了相对10〜15％的单词错误率，并将必需C任务的语音翻译质量提高了3.6〜9.2 bleu。

Attention-based sequence-to-sequence modeling provides a powerful and elegant solution for applications that need to map one sequence to a different sequence. Its success heavily relies on the availability of large amounts of training data. This presents a challenge for speech applications where labelled speech data is very expensive to obtain, such as automatic speech recognition (ASR) and speech translation (ST). In this study, we propose a general multi-task learning framework to leverage text data for ASR and ST tasks. Two auxiliary tasks, a denoising autoencoder task and machine translation task, are proposed to be co-trained with ASR and ST tasks respectively. We demonstrate that representing text input as phoneme sequences can reduce the difference between speech and text inputs, and enhance the knowledge transfer from text corpora to the speech to text tasks. Our experiments show that the proposed method achieves a relative 10~15% word error rate reduction on the English Librispeech task compared with our baseline, and improves the speech translation quality on the MuST-C tasks by 3.6~9.2 BLEU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题