Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.
translated by 谷歌翻译
最近编码的位置已显示在变压器体系结构中有效。它为序列不同位置的元素之间的依赖性建模提供了宝贵的监督。在本文中,我们首先研究了各种方法,以将位置信息整合到基于变压器的语言模型的学习过程中。然后,我们提出了一种名为旋转位置嵌入(绳索)的新颖方法,以有效利用位置信息。具体而言,提议的绳索用旋转矩阵编码绝对位置,同时将显式相对位置依赖性在自我发项公式中。值得注意的是,绳索具有宝贵的特性,包括序列长度的灵活性,衰减的相互依赖性随着相对距离的增加以及将线性自我注意力配备相对位置编码的能力。最后,我们在各种长文本分类基准数据集上使用旋转位置嵌入(也称为Roformer)评估增强的变压器。我们的实验表明,它始终如一地克服了其替代方案。此外,我们提供了理论分析来解释一些实验结果。 Roformer已经集成到HuggingFace:\ url {https://huggingface.co/docs/transformers/model_doc/roformer}。
translated by 谷歌翻译
我们在变压器中重新审视设计选择,并提出方法来解决它们在处理长序列中的弱点。首先,我们提出了一个名为“门控注意单元”的简单层,该层允许使用较弱的单头注意,而质量损失最小。然后,我们提出了一种与该新层的线性近似方法互补的,该方法对加速器友好且质量高度竞争。最终的型号(名为Flash)与短(512)和长(8K)上下文长度相匹配,在WIKI-40B上达到高达4.9 $ \ times $的训练速度和PG上的12.1 $ \ times $,在PG上达到了4.9 $ \ times $的困惑。-19用于自动回归语言建模,C4的4.8 $ \ times $用于掩盖语言建模。
translated by 谷歌翻译
Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position. Our findings indicate that causal LMs might derive positional awareness not only from the explicit positioning mechanism, but also from the effects of the causal mask.
translated by 谷歌翻译
Transformer models have achieved superior performance in various natural language processing tasks. However, the quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not flexible enough to capture complicated local information. We propose SPADE, short for $\underline{\textbf{S}}$tate s$\underline{\textbf{P}}$ace $\underline{\textbf{A}}$ugmente$\underline{\textbf{D}}$ Transform$\underline{\textbf{E}}$r. Specifically, we augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. The SSM augments global information, which complements the lack of long-range dependency issue in local attention methods. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method. To further demonstrate the scalability of SPADE, we pre-train large encoder-decoder models and present fine-tuning results on natural language understanding and natural language generation tasks.
translated by 谷歌翻译
状态空间模型已显示在建模远距离依赖性方面有效,特别是序列分类任务。在这项工作中,我们着重于对英语书籍,GitHub源代码和Arxiv数学文章的自回旋序列建模。基于围绕封闭激活功能的有效性的最新发展,我们提出了一个名为“封闭状态空间(GSS)”的新层,并表明它的训练速度明显快于TPU的S4(即DSS)的对角线版本,具有相当竞争力 - 基于变压器的基线,并表现出零击向更长的输入,同时直接实施。最后,我们表明,利用自我意见来建模局部依赖性,可以进一步提高GSS的性能。
translated by 谷歌翻译
许多NLP任务需要处理超出预磨模模型的长度限制的长语境。为了将这些模型扩展到更长的文本序列,已经提出了许多有效的远程注意力变体。尽管沿着这个方向进行了丰富的研究,但仍然难以在实际用例中衡量这些模型的相对有效性,例如,如果我们在预先rain-yfetune范式之后应用这些模型。在这项工作中,我们的目标是对这些具有大规模和受控实验的这些新兴模型进行彻底的分析。对于每个关注变体,我们使用相同的长DOC语料库,然后使用相同的长DOC语料库,然后为现实世界的长情节任务进行芬特这些模型。我们的调查结果揭示了现有广泛使用的远程基准的陷阱,并显示任何经过测试的高效关注可以在标准预介质范式下击败一个简单的本地窗口关注。对本地注意力变化的进一步分析表明,即使是常用的注意力窗口重叠也没有必要实现良好的下游结果 - 使用不相交的本地关注,我们能够构建符合性能的更简单且更高效的Long-Doc QA模型霍尔福勒〜\ citep {longformer}其预先花费的一半。
translated by 谷歌翻译
现实世界中的数据是高维的:即使在压缩后,书籍,图像或音乐表演也很容易包含数十万个元素。但是,最常用的自回归模型,变压器非常昂贵,以缩放捕获这种远程结构所需的输入和层数。我们开发了感知者AR,这是一种自回归的模态 - 不合骨架构,它使用交叉注意力将远程输入映射到少数潜在的潜在,同时还可以维护端到端的因果关系掩盖。感知器AR可以直接进行十万个令牌,从而实现了实用的长篇小写密度估计,而无需手工制作的稀疏模式或记忆机制。当对图像或音乐进行培训时,感知器AR会生成具有清晰长期连贯性和结构的输出。我们的架构还获得了长期基准测试的最新可能性,包括64 x 64个Imagenet图像和PG-19书籍。
translated by 谷歌翻译
变压器注意机制的二次计算和内存复杂性限制了对长序列建模的可扩展性。在本文中,我们提出了Luna,一种线性统一嵌套关注机制,使Softmax注意力具有两个嵌套线性关注功能,仅产生线性(与二次)的时间和空间复杂度相反。具体地,通过第一注意功能,LUNA将输入序列包装成固定长度的序列。然后,使用第二关注功能未包装包装序列。与更传统的关注机制相比,LUNA引入具有固定长度的附加序列作为输入和额外的相应输出,允许LUNA线性地进行关注操作,同时还存储足够的上下文信息。我们对三个序列建模任务的基准进行了广泛的评估:长上下文序列建模,神经机平移和大型预磨损的屏蔽语言建模。竞争甚至更好的实验结果表明了Luna的有效性和效率与各种各样相比
translated by 谷歌翻译
Recent work has improved language models (LMs) remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce mem-ories at testing time or represent them using a separately trained encoder, resulting in suboptimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training LMs with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories--local, long-term, and external memory--at testing time. We evaluate TRIME on multiple language modeling and machine translation benchmarks and show that it is able to achieve significant improvements across all the settings. Concretely, TRIME reduces the perplexity from 18.70 to 15.37 on WIKITEXT-103, by effectively leveraging a large memory set from the training corpus. Compared to standard LM training, TRIME adds negligible computational overhead and is compatible with different neural architectures, making it a versatile solution for training memory-augmented LMs.
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
Length extrapolation is a desirable property that permits training a transformer language model on short sequences and retaining similar perplexities when the model is tested on substantially longer sequences. A relative positional embedding mechanism applied on the transformer self-attention matrix, ALiBi, demonstrates the length extrapolation property with the widest usage to date. In this paper, we show that ALiBi surprisingly does not utilize tokens further than the training sequence length, which can be explained by its implicit windowed attention effect that aligns the receptive field during training and testing stages. Inspired by ALiBi and the receptive filed alignment hypothesis, we propose another transformer positional embedding design named~\textbf{Sandwich} that uses longer than training sequence length information, and it is a greatly simplified formulation of the earliest proposed Sinusoidal positional embedding. Finally, we show that both ALiBi and Sandwich enable efficient inference thanks to their implicit windowed attention effect.
translated by 谷歌翻译
近年来,基于变压器的预训练模型已获得了很大的进步,成为自然语言处理中最重要的骨干之一。最近的工作表明,变压器内部的注意力机制可能不需要,卷积神经网络和基于多层感知器的模型也已被研究为变压器替代方案。在本文中,我们考虑了一个用于语言模型预训练的图形循环网络,该网络通过本地令牌级通信为每个序列构建一个图形结构,以及与其他代币解耦的句子级表示。原始模型在受监督培训下的特定领域特定文本分类中表现良好,但是,其通过自我监督的方式学习转移知识的潜力尚未得到充分利用。我们通过优化体系结构并验证其在更通用的语言理解任务(英语和中文)中的有效性来填补这一空白。至于模型效率,我们的模型在基于变压器的模型中而不是二次复杂性,而是具有线性复杂性,并且在推断过程中的性能更有效。此外,我们发现与现有基于注意力的模型相比,我们的模型可以生成更多样化的输出,而背景化的功能冗余性较小。
translated by 谷歌翻译
我们介绍了块状变压器,该变压器以序列的反复方式应用变压器层,并且相对于序列长度具有线性复杂性。我们的复发单元在训练过程中在代币的块而不是单个令牌上运行,并利用块内并行计算,以便有效利用加速器硬件。单元本身非常简单。它仅仅是一个变压器层:它使用自我注意事项和交叉注意力来有效计算大量状态向量和令牌上的复发函数。我们的设计部分受到LSTM单元的启发,它使用LSTM风格的大门,但它可以将典型的LSTM单元缩放为几个数量级。我们的复发实现在计算时间和参数计数中都具有相同的成本作为传统的变压器层,但是在很长的序列中,语言建模任务中的语言建模任务的困惑极大地改善了。我们的模型比远程变压器XL基线的表现宽大,同时运行的速度是两倍。我们证明了它在PG19(书籍),Arxiv论文和GitHub源代码上的有效性。我们的代码已发布为开​​源。
translated by 谷歌翻译
由于其二次复杂性,是变压器中的关注模块,其是变压器中的重要组件不能高效地扩展到长序列。许多工作侧重于近似于尺寸的圆点 - 指数的软MAX功能,导致分二次甚至线性复杂性变压器架构。但是,我们表明这些方法不能应用于超出点的指数样式的更强大的注意模块,例如,具有相对位置编码(RPE)的变压器。由于在许多最先进的模型中,相对位置编码被用作默认,设计可以包含RPE的高效变压器是吸引人的。在本文中,我们提出了一种新颖的方法来加速对RPE的转化仪的关注计算在核心化的关注之上。基于观察到相对位置编码形成Toeplitz矩阵,我们数在数学上表明,可以使用快速傅里叶变换(FFT)有效地计算具有RPE的核化注意。使用FFT,我们的方法实现$ \ mathcal {o}(n \ log n)$时间复杂性。有趣的是,我们进一步证明使用相对位置编码适当地可以减轻香草群关注的培训不稳定问题。在广泛的任务上,我们经验证明我们的模型可以从头开始培训,没有任何优化问题。学习模型比许多高效的变压器变体更好地执行,并且在长序列制度中比标准变压器更快。
translated by 谷歌翻译
基于变压器的模型广泛用于自然语言处理(NLP)。变压器模型的核心是自我关注机制,它捕获了输入序列中的令牌对的相互作用,并在序列长度上逐步取决于逐行。在更长的序列上培训此类模型是昂贵的。在本文中,我们表明,基于局部敏感散列(LSH)的伯努利采样注意机制降低了这种模型到线性的二次复杂性。我们通过考虑自我关注作为与Bernoulli随机变量相关的单独令牌的总和来绕过二次成本,原则上可以通过单个哈希进行一次(尽管在实践中,这个数字可能是一个小常数)。这导致了有效的采样方案来估算依赖于LSH的特定修改的自我关注(以便在GPU架构上进行部署)。我们在标准512序列长度上评估了胶水基准的算法,在那里我们看到了相对于标准预磨削变压器的良好性能。在远程竞技场(LRA)基准中,为了评估长序列的性能,我们的方法实现了与Softmax自我关注的结果一致,但具有相当大的加速和内存节省,并且通常优于其他有效的自我关注方法。我们的代码可以在https://github.com/mlpen/yoso获得
translated by 谷歌翻译
变压器注意机制中的设计选择,包括弱电感偏置和二次计算复杂性,限制了其用于建模长序列的应用。在本文中,我们介绍了一个简单的,理论上的,单头的门控注意机制,配备了(指数)移动平均线,以将局部依赖性的电感偏置纳入位置 - 敏锐的注意机制中。我们进一步提出了一个具有线性时间和空间复杂性的大型变体,但通过将整个序列分为固定长度的多个块,仅产生最小的质量损失。对广泛的序列建模基准测试的广泛实验,包括远距离竞技场,神经机器翻译,自动回归语言建模以及图像和语音分类,表明,巨人比其他序列模型取得了重大改进,包括变种物的变体和最新的变体模型状态空间模型。
translated by 谷歌翻译
The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and gather the contextualized information from unmasked tokens to restore the corrupted information. It raises the question of whether we can append [MASK]s at a later layer, to reduce the sequence length for earlier layers and make the pre-training more efficient. We show: (1) [MASK]s can indeed be appended at a later layer, being disentangled from the word embedding; (2) The gathering of contextualized information from unmasked tokens can be conducted with a few layers. By further increasing the masking rate from 15% to 50%, we can pre-train RoBERTa-base and RoBERTa-large from scratch with only 78% and 68% of the original computational budget without any degradation on the GLUE benchmark. When pre-training with the original budget, our method outperforms RoBERTa for 6 out of 8 GLUE tasks, on average by 0.4%.
translated by 谷歌翻译
变压器架构现在是序列建模任务的核心。注意机制是核心,它可以在序列中对长期依赖性进行有效的建模。最近,变压器已成功地应用于计算机视觉域,在该域中首先将2D图像分割成斑块,然后将其视为1D序列。然而,这种线性化会损害图像中空间位置的概念,该图像具有重要的视觉线索。为了弥合差距,我们提出了连锁反应,这是视觉变压器的次级注意机制。基于最近基于内核的有效注意机制,我们设计了一种新型的动态编程算法,该算法将不同令牌的贡献加重了与它们在线性观察到的2D空间中相对空间距离的查询的贡献。广泛的实验和分析证明了连锁反应对各种视觉任务的有效性。
translated by 谷歌翻译
在本文中,我们试图通过引入深度学习模型的句法归纳偏见来建立两所学校之间的联系。我们提出了两个归纳偏见的家族,一个家庭用于选区结构,另一个用于依赖性结构。选区归纳偏见鼓励深度学习模型使用不同的单位(或神经元)分别处理长期和短期信息。这种分离为深度学习模型提供了一种方法,可以从顺序输入中构建潜在的层次表示形式,即更高级别的表示由高级表示形式组成,并且可以分解为一系列低级表示。例如,在不了解地面实际结构的情况下,我们提出的模型学会通过根据其句法结构组成变量和运算符的表示来处理逻辑表达。另一方面,依赖归纳偏置鼓励模型在输入序列中找到实体之间的潜在关系。对于自然语言,潜在关系通常被建模为一个定向依赖图,其中一个单词恰好具有一个父节点和零或几个孩子的节点。将此约束应用于类似变压器的模型之后,我们发现该模型能够诱导接近人类专家注释的有向图,并且在不同任务上也优于标准变压器模型。我们认为,这些实验结果为深度学习模型的未来发展展示了一个有趣的选择。
translated by 谷歌翻译