Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis. Our code is publicly available 1 .
translated by 谷歌翻译
最近的几项研究表明,基于关注的网络,如视觉变压器(VIV),可以在几个计算机视觉任务上倾斜卷积神经网络(CNNS)而不使用卷积层。这自然导致以下问题:可以自我关注的Vit表达任何卷积操作吗?在这项工作中,我们证明了一种具有图像贴片的单个VIT层,因为输入可以建设性地执行任何卷积操作,其中多主题注意机制和相对位置编码起到基本角色。我们进一步提供了视觉变压器的头部数量的下限,以表达CNN。对应于我们的分析,实验结果表明,我们证据的建设可以帮助将卷积偏差注入变压器,并显着提高vit的低数据制度的性能。
translated by 谷歌翻译
Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pretrained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias. We initialize the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutionallike ViT architecture, ConViT, outperforms the DeiT (Touvron et al., 2020) on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analyzing how it is escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly at https://github.com/ facebookresearch/convit.
translated by 谷歌翻译
视觉变压器(VIT)用作强大的视觉模型。与卷积神经网络不同,在前几年主导视觉研究,视觉变压器享有捕获数据中的远程依赖性的能力。尽管如此,任何变压器架构的组成部分,自我关注机制都存在高延迟和低效的内存利用,使其不太适合高分辨率输入图像。为了缓解这些缺点,分层视觉模型在非交错的窗口上局部使用自我关注。这种放松会降低输入尺寸的复杂性;但是,它限制了横窗相互作用,损害了模型性能。在本文中,我们提出了一种新的班次不变的本地注意层,称为查询和参加(QNA),其以重叠的方式聚集在本地输入,非常类似于卷积。 QNA背后的关键想法是介绍学习的查询,这允许快速高效地实现。我们通过将其纳入分层视觉变压器模型来验证我们的层的有效性。我们展示了速度和内存复杂性的改进,同时实现了与最先进的模型的可比准确性。最后,我们的图层尺寸尤其良好,窗口大小,需要高于X10的内存,而不是比现有方法更快。
translated by 谷歌翻译
Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention
translated by 谷歌翻译
卷积和自我关注是表示学习的两个强大的技术,通常被认为是两个与彼此不同的对等方法。在本文中,我们表明它们之间存在强烈的潜在关系,从而在这两个范式的大部分计算实际上以相同的操作完成。具体来说,我们首先表明,具有内核大小k x k的传统卷积可以分解为k ^ 2个单独的1x1卷积,然后是换档和求和操作。然后,我们将自我注意模块中的查询,键和值解释为多个1x1卷积,然后计算注意力权重和值的聚合。因此,两个模块的第一阶段包括类似的操作。更重要的是,第一阶段有助于与第二阶段相比的主导计算复杂性(信道大小的正方形)。这种观察结果自然导致这两个看似独特的范例的优雅集成,即享有自我关注和卷积(ACMIX)的益处的混合模型,同时与纯卷积或自我关注对应相比具有最小的计算开销。广泛的实验表明,我们的模型在图像识别和下游任务上持续改进了竞争基础的结果。代码和预先训练的型号将在https://github.com/panxuran/acmix和https://gitee.com/mindspore/models发布。
translated by 谷歌翻译
Convolutional networks have been the paradigm of choice in many computer vision applications. The convolution operation however has a significant weakness in that it only operates on a local neighborhood, thus missing global information. Self-attention, on the other hand, has emerged as a recent advance to capture long range interactions, but has mostly been applied to sequence modeling and generative modeling tasks. In this paper, we consider the use of self-attention for discriminative visual tasks as an alternative to convolutions. We introduce a novel two-dimensional relative self-attention mechanism that proves competitive in replacing convolutions as a stand-alone computational primitive for image classification. We find in control experiments that the best results are obtained when combining both convolutions and self-attention. We therefore propose to augment convolutional operators with this self-attention mechanism by concatenating convolutional feature maps with a set of feature maps produced via self-attention. Extensive experiments show that Attention Augmentation leads to consistent improvements in image classification on Im-ageNet and object detection on COCO across many different models and scales, including ResNets and a stateof-the art mobile constrained network, while keeping the number of parameters similar. In particular, our method achieves a 1.3% top-1 accuracy improvement on ImageNet classification over a ResNet50 baseline and outperforms other attention mechanisms for images such as . It also achieves an improvement of 1.4 mAP in COCO Object Detection on top of a RetinaNet baseline.
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers.As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https: //github.com/facebookresearch/LeViT.
translated by 谷歌翻译
Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12% fewer FLOPS and 29% fewer parameters. On COCO object detection, a pure self-attention model matches the mAP of a baseline RetinaNet while having 39% fewer FLOPS and 34% fewer parameters. Detailed ablation studies demonstrate that self-attention is especially impactful when used in later layers. These results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox. Code for this project is made available. 1 * Denotes equal contribution. Ordering determined by random shuffle. † Work done as a member of the Google AI Residency Program.
translated by 谷歌翻译
视觉多层感知器(MLP)在计算机视觉任务中表现出了有希望的表现,并成为CNNS和Vision Transformers的主要竞争对手。他们使用令牌混合层来捕获交叉互动,而不是变形金刚使用的多头自我发项机制。然而,严重的参数化令牌混合层自然缺乏捕获局部信息和多粒性非本地关系的机制,因此它们的判别能力受到限制。为了解决这个问题,我们提出了一个新的位置空间门控单元(POSGU)。它利用经典相对位置编码(RPE)中使用的注意力公式,以有效地编码令牌混合的交叉关系。它可以成功地将视觉MLP的当前二次参数复杂度$ O(n^2)$ $ O(n^2)$ o(n)$(n)$和$ o(1)$。我们实验了两种RPE机制,并进一步提出了一个小组扩展,以实现多种环境的成就,以提高其表现力。然后,它们是一种新型视觉MLP的关键构建块,称为POSMLP。我们通过进行彻底的实验来评估所提出的方法的有效性,证明参数复杂性的提高或可比性能得到了改善或可比性。例如,对于在ImagEnet1k上训练的模型,我们实现了从72.14 \%\%\%\%的绩效提高,并且可学习的参数从$ 194M $ $ $ $ $ $ $ $ 1.182亿美元。代码可以在\ href {https://github.com/zhicaiwww/posmlp} {https://github.com/zhicaiwww/posmlp}中找到代码。
translated by 谷歌翻译
有效的自定义合并技术可以积极地修剪特征图的尺寸,从而减少用于资源约束计算机视觉应用程序的推理计算和内存足迹,最近已获得了显着的牵引力。但是,先前的合并作品仅提取激活图的局部环境,从而限制了它们的有效性。相比之下,我们提出了一种新型的非本地自我煽动合并方法,该方法可用作标准合并层的液位替换,例如最大/平均池或跨性别卷积。所提出的自我发项模块使用斑块嵌入,多头自我注意力和空间通道恢复,然后进行乙状结肠激活和指数软效果。这种自我注意的机制有效地聚集了在下采样过程中非本地激活斑之间的依赖性。具有各种卷积神经网络(CNN)体系结构的标准对象分类和检测任务的广泛实验证明了我们所提出的机制优于最先进的(SOTA)合并技术。特别是,我们超过了在Imabilenet-V2上不同变体上的现有合并技术的测试准确性,平均平均为1.2%。随着初始层中激活图的激进下采样(可减少记忆消耗的22倍),与具有ISO-MEMORY足迹的SOTA技术相比,我们的方法的测试准确性提高了1.43%。这使我们的模型可以在内存受限的设备中部署,例如微型控制器(不会失去明显的精度),因为初始激活映射会消耗大量的芯片内存储器,用于复杂视觉任务所需的高分辨率图像。我们提出的合并方法还利用了通道修剪的想法,以进一步减少记忆足迹。
translated by 谷歌翻译
我们表明,没有图形特异性修改的标准变压器可以在理论和实践中都带来图形学习的有希望的结果。鉴于图,我们只是将所有节点和边缘视为独立的令牌,用令牌嵌入增强它们,然后将它们馈入变压器。有了适当的令牌嵌入选择,我们证明这种方法在理论上至少与不变的图形网络(2-ign)一样表达,由等效线性层组成,它已经比所有消息传播的图形神经网络(GNN)更具表现力)。当在大规模图数据集(PCQM4MV2)上接受训练时,与具有精致的图形特异性电感偏置相比,与GNN基准相比,与GNN基准相比,与GNN基准相比,与GNN基准相比,我们创造的令牌化图形变压器(Tokengt)取得了明显更好的结果。我们的实施可从https://github.com/jw9730/tokengt获得。
translated by 谷歌翻译
变压器模型是置换等分之一的。要提供输入令牌的顺序和类型信息,通常将位置和段嵌入式添加到输入中。最近的作品提出了具有相对位置编码的位置编码的变化,实现了更好的性能。我们的分析表明,增益实际上来自从输入中将位置信息移动到注意层。由此激励,我们介绍了变压器(饮食)的解耦的位置注意,一个简单但有效的机制,将位置和分段信息编码为变压器模型。该方法具有更快的培训和推理时间,同时在胶水,Xtreme和WMT基准上实现竞争性能。我们进一步概括了我们的方法到远程变压器并显示性能增益。
translated by 谷歌翻译
最近的工作表明了计算机视觉应用的变压器的潜力。第一图像首先分区,然后将其用作注意机制的输入令牌。由于注意机构的昂贵二次成本,使用大的贴片尺寸,导致粗糙的全局相互作用,或者,替代地,仅在图像的局部区域上施加注意力,以牺牲远程相互作用为代价。在这项工作中,我们提出了一种方法,该方法允许在视觉变压器的早期层上允许粗糙的全局相互作用和细粒局部相互作用。在我们的方法的核心,是应用本地和全球注意层的应用。在本地注意层中,我们对每个补丁及其本地移位进行注意,导致几乎位于本地补丁,这些修补程序不绑定到单个特定位置。然后在全球注意层中使用这些实际的补丁。注意层进入本地和全局对应物的分离允许在贴片的数量中进行低计算成本,同时仍然支持已经在第一层处的数据相关的本地化,而不是其他可视变压器中的静态定位。我们的方法被证明优于基于卷积和变压器的图像分类方法,用于CIFAR10,CIFAR100和Imagenet。代码可在:https://github.com/shellysheynin/locally-sag-transformer。
translated by 谷歌翻译
自我发挥作用机制通过在所有输入令牌之间使用成对的注意来对远程环境进行建模。在这样做时,他们假设由个体令牌(例如文本字符或图像像素)定义的固定注意粒度,这对于在较高级别上建模复杂依赖性可能不是最佳的。在本文中,我们提出了ContextPool,通过调整每个令牌的注意力粒度来解决此问题。受到与合并以捕获远程依赖关系的Convnets成功的启发,我们学会了为每个令牌汇总相邻功能,然后在给定的注意力层中计算注意力。合并的权重和支撑大小是自适应确定的,允许汇总功能以不同的规模编码有意义的上下文。我们表明,ContextPool使注意力模型更具表现力,经常以更少的层次实现强大的性能,从而大大降低了成本。实验验证我们的上下文池模块插入变压器模型时,使用几种语言和图像基准的计算较少计算,匹配或超越了最先进的性能,胜过最新的作品,这些作品具有学习的上下文大小或稀疏注意的模式,并且也适用为了进行有效的功能学习。
translated by 谷歌翻译
我们提出了邻里注意力变压器(NAT),这是一种有效,准确和可扩展的层次变压器,在图像分类和下游视觉任务上都很好地工作。它建立在邻里注意力(NA)的基础上,这是一种简单而灵活的注意机制,将每个查询的接受场都定位到其最近的相邻像素。 NA是自我注意的本地化,并且随着接收场大小的增加而接近它。在拖曳和记忆使用方面,它也等同于Swin Transformer的转移窗口的注意力,而同样的接收场大小,同时受到了较少的约束。此外,NA包括局部电感偏见,从而消除了对像素移位等额外操作的需求。 NAT的实验结果具有竞争力; Nat-tiny在Imagenet上仅具有4.3 GFLOPS和28M参数,在MS-Coco上达到51.4%的MAP和ADE20K上的48.4%MIOU。我们在:https://github.com/shi-labs/neighborhood-cithention-transformer上开放了检查点,代码和CUDA内核。
translated by 谷歌翻译
多头注意力是最先进的变压器背后的推动力,它在各种自然语言处理(NLP)和计算机视觉任务中实现了出色的性能。已经观察到,对于许多应用,这些注意力头会学习冗余嵌入,并且大多数可以在不降低模型性能的情况下去除。受到这一观察的启发,我们提出了变压器的混合物(变压器-MGK)的混合物,这是一种新型的变压器架构,用每个头部的钥匙混合了变压器中的冗余头部。这些键的混合物遵循高斯混合模型,并使每个注意力头有效地集中在输入序列的不同部分上。与传统的变压器对应物相比,变压器-MGK会加速训练和推理,具有较少的参数,并且需要更少的拖船来计算,同时实现跨任务的可比性或更高的准确性。 Transformer-MGK也可以轻松扩展到线性注意力。我们从经验上证明了在一系列实用应用中变形金属MGK的优势,包括语言建模和涉及非常长序列的任务。在Wikitext-103和远程竞技场基准中,具有4个头部的变压器MGK具有与基线变压器具有8个头的可比性或更好的性能。
translated by 谷歌翻译
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. * Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.† Work performed while at Google Brain.‡ Work performed while at Google Research.
translated by 谷歌翻译
Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions. Self-attention models have recently been shown to have encouraging improvements on accuracy-parameter trade-offs compared to baseline convolutional models such as ResNet-50. In this work, we aim to develop self-attention models that can outperform not just the canonical baseline models, but even the high-performing convolutional models. We propose two extensions to selfattention that, in conjunction with a more efficient implementation of self-attention, improve the speed, memory usage, and accuracy of these models. We leverage these improvements to develop a new self-attention model family, HaloNets, which reach state-of-the-art accuracies on the parameterlimited setting of the ImageNet classification benchmark. In preliminary transfer learning experiments, we find that HaloNet models outperform much larger models and have better inference performance. On harder tasks such as object detection and instance segmentation, our simple local self-attention and convolutional hybrids show improvements over very strong baselines. These results mark another step in demonstrating the efficacy of self-attention models on settings traditionally dominated by convolutional models.
translated by 谷歌翻译