针对变压器语言模型的培训

论文标题

针对变压器语言模型的培训

Staged Training for Transformer Language Models

论文作者

Shen, Sheng, Walsh, Pete, Keutzer, Kurt, Dodge, Jesse, Peters, Matthew, Beltagy, Iz

论文摘要

缩放变压器语言模型的当前标准方法从不同的随机初始化训练每个模型大小。作为替代方案，我们考虑了一个分阶段的培训设置，该设置始于小型模型，并通过应用“增长操作员”来增加模型深度和宽度，从而增加了用于训练的计算量。通过使用上一个阶段的每个阶段初始化每个阶段，训练过程可以有效地从前阶段重新使用计算，并变得更有效。我们的成长运营商每个都将整个培训状态（包括模型参数，优化器状态，学习率计划等）作为输入，并输出一种新的培训状态，从而继续进行培训。我们确定了这些增长运营商的两个重要特性，即他们在应用操作员后既保留损失和“训练动态”。虽然先前已经讨论过保留损失的财产，但据我们所知，这项工作是第一个确定保留训练动态的重要性（培训期间损失的降低率）。为了找到阶段的最佳时间表，我们使用（Kaplan等，2020）的缩放定律来找到一个精确的时间表，在训练效率开始降低时，通过开始新阶段来提供最大的计算节省。我们从经验上验证了我们的增长运营商并为自回归语言模型进行了培训，与从头开始训练的强大基线相比，高达22％的计算节省。我们的代码可从https://github.com/allenai/stage-training获得。

The current standard approach to scaling transformer language models trains each model size from a different random initialization. As an alternative, we consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training by applying a "growth operator" to increase the model depth and width. By initializing each stage with the output of the previous one, the training process effectively re-uses the compute from prior stages and becomes more efficient. Our growth operators each take as input the entire training state (including model parameters, optimizer state, learning rate schedule, etc.) and output a new training state from which training continues. We identify two important properties of these growth operators, namely that they preserve both the loss and the "training dynamics" after applying the operator. While the loss-preserving property has been discussed previously, to the best of our knowledge this work is the first to identify the importance of preserving the training dynamics (the rate of decrease of the loss during training). To find the optimal schedule for stages, we use the scaling laws from (Kaplan et al., 2020) to find a precise schedule that gives the most compute saving by starting a new stage when training efficiency starts decreasing. We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings compared to a strong baseline trained from scratch. Our code is available at https://github.com/allenai/staged-training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题