E-Lang：基于能量的超级和迅速语言模型的联合推断

论文标题

E-Lang：基于能量的超级和迅速语言模型的联合推断

E-LANG: Energy-Based Joint Inferencing of Super and Swift Language Models

论文作者

Akbari, Mohammad, Banitalebi-Dehkordi, Amin, Zhang, Yong

论文摘要

在过去的几年中，建立庞大且高度强大的语言模型一直是一种趋势。尽管表现出色，但它们仍会产生高计算成本。一个常见的解决方案是应用模型压缩或选择轻重量体系结构，该架构通常需要为每个理想的计算预算提供单独的固定尺寸模型，并且在重压的情况下可能会失去性能。本文提出了一种有效的动态推理方法，称为E-Lang，该方法在大型准确的超级模型和轻质Swift模型之间分布了推断。为此，制定模块根据潜在空间中表示的能量特征将输入路由到超级或SWIFT模型。此方法很容易采用，架构不可知。因此，它可以应用于黑盒预先训练的模型，而无需进行架构操作，重新组装模块或重新训练。与仅适用于仅编码式骨架和分类任务的现有方法不同，我们的方法还适用于编码器删除结构和序列到序列任务，例如翻译。通过用T5和BERT骨架在胶水，超粘合和WMT上的T5和BERT骨架实验来验证E-Lang性能。特别是，我们的表现均优于T5-11B，平均计算速度为3.3 $ \ times $ $ \ times $，而超级胶水上的$ 2.9 $ \ times $。我们还以3.2 $ \ times $ $少的计算来实现基于BERT的SOTA。补充材料中有代码和演示。

Building huge and highly capable language models has been a trend in the past years. Despite their great performance, they incur high computational cost. A common solution is to apply model compression or choose light-weight architectures, which often need a separate fixed-size model for each desirable computational budget, and may lose performance in case of heavy compression. This paper proposes an effective dynamic inference approach, called E-LANG, which distributes the inference between large accurate Super-models and light-weight Swift models. To this end, a decision making module routes the inputs to Super or Swift models based on the energy characteristics of the representations in the latent space. This method is easily adoptable and architecture agnostic. As such, it can be applied to black-box pre-trained models without a need for architectural manipulations, reassembling of modules, or re-training. Unlike existing methods that are only applicable to encoder-only backbones and classification tasks, our method also works for encoder-decoder structures and sequence-to-sequence tasks such as translation. The E-LANG performance is verified through a set of experiments with T5 and BERT backbones on GLUE, SuperGLUE, and WMT. In particular, we outperform T5-11B with an average computations speed-up of 3.3$\times$ on GLUE and 2.9$\times$ on SuperGLUE. We also achieve BERT-based SOTA on GLUE with 3.2$\times$ less computations. Code and demo are available in the supplementary materials.

下载PDF全文

下载文献需遵守相关版权规定

论文标题