论文标题
用对角线线性RNN简化和理解状态空间模型
Simplifying and Understanding State Space Models with Diagonal Linear RNNs
论文作者
论文摘要
基于线性状态空间(SSM)的序列模型最近已成为建筑的有前途的选择,用于建模各种模态的远距离依赖性。但是,他们总是依赖于连续状态空间的离散化,这使他们的表现和理解变得复杂。在这项工作中,我们处置了离散步骤,并提出了一个基于Vanilla对角线性RNN($ \ Mathrm {dlr} $)的模型。我们从经验上表明,尽管从概念上讲要简单得多,但$ \ mathrm {dlr} $在各种任务和基准上的表现都与包括长距离竞技场和原始语音分类在内的各种任务和基准。此外,我们通过一套$ 13 $合成顺序到序列任务的套件来表征SSM(包括$ \ mathrm {dlr} $)和基于注意力的模型的表达,涉及数以千计的代币相互作用的序列任务,包括简单的操作,包括简单的操作,包括转换输入序列,转换coledent cole flatextial flundent flatiant flatiant flatiand tlultpatiald pating intpatial intpatiant intpatiant intpatials intpatiant intpatiant intpat rat intpatt ratt ratt and int spatift inting rating rang。我们发现,虽然SSM在可以通过$ \ textit {少数} $卷积内核进行建模的任务上报告了几乎完美的性能,但他们在需要$ \ textit {Moand} $之内的任务上挣扎,尤其是当期望的序列操纵是$ \ textit {textit {context {context依赖性} $。尽管有这些限制,$ \ mathrm {dlr} $在两个高阶推理任务上达到了高性能,$ \ mathrm {listOpsSubtrees} $和$ \ mathrm {pathFinderSegmentation} \ text} \ text { - } { - } \ mathrm {256} $ andup $ 8k $ $ 8K $ 8K $ 8K $ 8K $ 65K $ \ mathrm {pathFinderSegmentation} \ text { - } \ mathrm {512} $带输入长度$ 262K $,而注意力不是可行的选择。
Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of architecture for modeling long range dependencies across various modalities. However, they invariably rely on discretization of a continuous state space, which complicates their presentation and understanding. In this work, we dispose of the discretization step, and propose a model based on vanilla Diagonal Linear RNNs ($\mathrm{DLR}$). We empirically show that, despite being conceptually much simpler, $\mathrm{DLR}$ is as performant as previously-proposed SSMs on a variety of tasks and benchmarks including Long Range Arena and raw speech classification. Moreover, we characterize the expressivity of SSMs (including $\mathrm{DLR}$) and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks involving interactions over tens of thousands of tokens, ranging from simple operations, such as shifting an input sequence, to detecting co-dependent visual features over long spatial ranges in flattened images. We find that while SSMs report near-perfect performance on tasks that can be modeled via $\textit{few}$ convolutional kernels, they struggle on tasks requiring $\textit{many}$ such kernels and especially when the desired sequence manipulation is $\textit{context-dependent}$. Despite these limitations, $\mathrm{DLR}$ reaches high performance on two higher-order reasoning tasks $\mathrm{ListOpsSubTrees}$ and $\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$ with input lengths $8K$ and $65K$ respectively, and gives encouraging performance on $\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$ with input length $262K$ for which attention is not a viable choice.