论文标题
教小语言模型以推理
Teaching Small Language Models to Reason
论文作者
论文摘要
促使思想链成功地提高了大语言模型的推理能力,从而在一系列数据集上实现了最新的最新结果。但是,这些推理能力仅在大小超过1000亿参数的模型中出现。在本文中,我们通过知识蒸馏探讨了这种推理能力向具有少于1000亿参数的模型的传递。具体而言,我们对较大的教师模型产生的思想输出链的学生模型进行了验证。我们的实验表明,所提出的方法改善了跨算术,常识性和象征性推理数据集的任务性能。例如,当在Palm-540B上易经时,T5 XXL在GSM8K上的准确性从8.11%提高到21.99%。
Chain of thought prompting successfully improves the reasoning capabilities of large language models, achieving state of the art results on a range of datasets. However, these reasoning capabilities only appear to emerge in models with a size of over 100 billion parameters. In this paper, we explore the transfer of such reasoning capabilities to models with less than 100 billion parameters via knowledge distillation. Specifically, we finetune a student model on the chain of thought outputs generated by a larger teacher model. Our experiments show that the proposed method improves task performance across arithmetic, commonsense and symbolic reasoning datasets. For example, the accuracy of T5 XXL on GSM8K improves from 8.11% to 21.99% when finetuned on PaLM-540B generated chains of thought.