论文标题
通过交换性和潜在变量模型的镜头对注意力的分析
An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models
论文作者
论文摘要
通过注意力机制,变压器获得了重大的经验成功。尽管直观地理解了变形金刚在长序列上进行关系推断以产生理想的表示,但我们缺乏关于注意机制如何实现它的严格理论。特别是,几个有趣的问题仍然开放:(a)是什么使理想的代表? (b)注意机制如何推断向前通行证内的理想表示? (c)预处理程序如何学会通过向后通过的理想表示? 我们观察到,与Bert和Vit一样,输入令牌通常可以交换,因为它们已经包括位置编码。交换性的概念诱导了一个潜在的变量模型,该模型是输入大小的不变,这可以使我们的理论分析。 - 回答(a)表示表示,我们确定了输入令牌的足够且最小的表示。特别是,这种表示形式实例化了潜在变量给定输入令牌的后验分布,该分布在预测输出标签和求解下游任务中起着核心作用。 - 回答(b)关于推论,我们证明了所需参数的注意力渗透到潜在的后部到近似误差,这在输入大小中正在减少。详细说明,我们量化了注意力如何近似给定值的条件平均值,这表征了它如何在长序列上执行关系推断。 - 回答(c)学习,我们证明了监督和自我监管的目标都允许经验风险最小化可以学习所需的参数,直到概括错误,这与输入大小无关。特别是,在自我监督的环境中,我们确定了解决下游任务至关重要的条件号。
With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.