论文标题
缩放定律,用于自回旋生成建模
Scaling Laws for Autoregressive Generative Modeling
论文作者
论文摘要
我们确定了四个领域中跨渗透损失的经验缩放定律:生成图像建模,视频建模,多模式图像$ \ leftrightArrow $文本模型和数学问题解决。在所有情况下,随着模型规模和计算预算的增加,自回旋变压器的性能都会顺利提高,并按照幂律加上恒定的缩放定律。最佳模型大小还取决于通过幂律的计算预算,其指数几乎在所有数据域中都是通用的。 跨凝性损失具有信息理论解释为$ s($ true $) + d _ {\ mathrm {kl}}}($ true $ || $模型$)$,并且经验缩放定律建议对真数据分布的熵和真实分布之间的kl偏差进行预测。通过这种解释,十亿参数变形金刚几乎是YFCC100M图像分布的完美模型,将其降低到$ 8 \ times 8 $分辨率,我们可以预测在NATS/IMAGE中,在NATS/Image中,在其他分辨率中,在NATS/IMAGE中实现了任何给定的可还原损失所需的模型大小(IE $ D _ {\ MATHRM {KL}} $)。 我们在特定领域中找到了许多其他缩放定律:(a)我们确定了多模式模型中标题和图像之间相互信息的缩放关系,并显示如何回答“图片值得一千个单词?”; (b)在解决数学问题解决方案的情况下,我们在推断训练分布的外推时确定了模型性能的缩放定律; (c)即使生成损失水平关闭,我们也用于成像网分类的Finetune生成图像模型,并找到分类损耗和错误率的平滑缩放。综上所述,这些结果加强了缩放定律对神经网络性能的重要影响,包括对下游任务。
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.