论文标题
将推理功能提炼成较小的语言模型
Distilling Reasoning Capabilities into Smaller Language Models
论文作者
论文摘要
事实证明,诸如思想链(COT)之类的逐步推理方法在诱导大语言模型中的推理能力方面非常有效。但是,COT方法的成功从根本上与模型大小相关,而通常需要数十亿个参数规模的模型才能使COT工作。在本文中,我们提出了一种知识蒸馏方法,该方法利用了较大模型的分步COT推理能力,并将这些能力提炼成较小的模型。 在这项工作中,我们提出了一种替代推理方案Socratic Cot,该方案将原始问题分解为一系列子问题,并使用它来指导中间的推理步骤。我们使用Socratic Cot来训练两个小型蒸馏型的组合:一个问题分解器和一个子问题解决者。实际上,考虑到一个新问题,这两个蒸馏模型可以同步分解和解决复杂问题。在多个推理数据集(GSM8K,StrategionQA和SVAMP)上,我们提出的蒸馏策略可提高较小型号的性能,而较小的模型超过70%。最后,我们研究苏格拉底床何时是COT的有效替代品,证明了一个较小的模型(GPT-2大)可以胜过较大的模型(GPT-3 6B)的情况。我们的代码可在此处提供:https://github.com/kumar-shridhar/distiiling-lm
Step-by-step reasoning approaches like chain of thought (CoT) have proved to be very effective in inducing reasoning capabilities in large language models. However, the success of the CoT approach is fundamentally tied to the model size, and billion parameter-scale models are often needed to get CoT to work. In this paper, we propose a knowledge distillation approach that leverages the step-by-step CoT reasoning capabilities of larger models and distills these abilities into smaller models. In this work, we propose an alternative reasoning scheme, Socratic CoT, that learns a decomposition of the original problem into a sequence of subproblems and uses it to guide the intermediate reasoning steps. We use Socratic CoT to train a combination of two small distilled models: a problem decomposer and a subproblem solver. In practice, given a new problem, the two distilled models work in sync to decompose and solve complex problems. On multiple reasoning datasets (GSM8K, StrategyQA, and SVAMP), our proposed distillation strategies boosts the performance of smaller models over 70% compared to the baselines. Finally, we investigate when Socratic CoT is an effective alternative to CoT, demonstrating cases where a much smaller model (GPT-2 large) can outperform a 10X larger model (GPT-3 6B). Our code is available here: https://github.com/kumar-shridhar/Distiiling-LM