论文标题
少更多:语言模型压缩的任务意识层蒸馏
Less is More: Task-aware Layer-wise Distillation for Language Model Compression
论文作者
论文摘要
层次蒸馏是将大型模型(即教师模型)压缩为小型模型(即学生模型)的强大工具。学生通过模仿每个中间层的老师的隐藏表示形式来提炼老师的知识。但是,层次蒸馏很困难。由于学生的模型容量要比老师较小,因此通常不适合。此外,教师的隐藏表示形式包含了学生不一定需要进行目标任务学习的冗余信息。为了应对这些挑战,我们提出了一种新颖的任务感知层蒸馏(TED)。 TED设计任务感知过滤器,以使学生和每一层的老师的隐藏表示形式对齐。过滤器从隐藏表示形式中选择对目标任务有用的知识。因此,TED减少了两个模型之间的知识差距,并帮助学生更好地适应目标任务。我们在两种情况下评估TED:持续的预训练和微调。在两种情况下,TED对现有蒸馏方法均表现出显着和一致的改进。代码可从https://github.com/cliang1453/task-aware-distillation获得。
Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios. Code is available at https://github.com/cliang1453/task-aware-distillation.