大规模ASR的软目标RNN-T蒸馏的比较

论文标题

大规模ASR的软目标RNN-T蒸馏的比较

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

论文作者

Hwang, Dongseong, Sim, Khe Chai, Zhang, Yu, Strohman, Trevor

论文摘要

知识蒸馏是一种有效的机器学习技术，可以将知识从教师模型转移到较小的学生模型，尤其是使用未标记的数据。在本文中，我们专注于RNN-T模型的知识蒸馏，该模型广泛用于最新的（SOTA）自动语音识别（ASR）。具体来说，我们比较了使用软目标蒸馏，以训练Librispeech/Liblilight公共数据集（60k小时）和我们的内部数据（60万小时）上的大型尺寸T型号。我们发现，当教师和学生拥有不同的建筑（例如大型老师和小流学生）时，硬焦油盖更有效。另一方面，软目标蒸馏在自我训练的场景中效果更好，例如迭代大型教师培训。对于具有0.6B权重的大型型号，我们使用具有软目标蒸馏的嘈杂的学生培训在LibrisPeech上实现了新的SOTA单词错误率（WER）（Qualter-Othere的相对改善8％）。它还使我们的生产老师能够连续适应新的数据域。

Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft target distillation. It also allows our production teacher to adapt new data domain continuously.

下载PDF全文

下载文献需遵守相关版权规定

论文标题