最近,视觉变形金刚(VITS)正在快速发展,并开始挑战计算机视觉(CV)领域的卷积神经网络(CNNS)的统治。利用用于更换卷积的硬编码的感应偏差的通用变压器架构,VITS已经超过了CNN,尤其是数据充足的情况。然而,VITS容易超过小型数据集,因此依靠大规模的预训练,这花费了巨大的时间。在本文中,我们努力通过引入CNNS的归纳偏见来解放VITS,通过返回vits,同时保留其网络架构以获得更高的上限并设置更合适的优化目标。首先,代理CNN基于具有感应偏差的给定韦尔设计。然后提出了一种自举训练算法,共同优化了重量共享的药剂和vit,在此期间,VIT学习来自代理的中间特征的诱导偏差。具有有限培训数据的CiFar-10/100和Imagenet-1k上的广泛实验表明,令人鼓舞的结果,感应偏差有助于VITS更快地收敛,甚至更少的参数。
translated by 谷歌翻译
随着变压器作为语言处理的标准及其在计算机视觉方面的进步,参数大小和培训数据的数量相应地增长。许多人开始相信,因此,变形金刚不适合少量数据。这种趋势引起了人们的关注,例如:某些科学领域中数据的可用性有限,并且排除了该领域研究资源有限的人。在本文中,我们旨在通过引入紧凑型变压器来提出一种小规模学习的方法。我们首次表明,具有正确的尺寸,卷积令牌化,变压器可以避免在小数据集上过度拟合和优于最先进的CNN。我们的模型在模型大小方面具有灵活性,并且在获得竞争成果的同时,参数可能仅为0.28亿。当在CIFAR-10上训练Cifar-10,只有370万参数训练时,我们的最佳模型可以达到98%的准确性,这是与以前的基于变形金刚的模型相比,数据效率的显着提高,比其他变压器小于10倍,并且是15%的大小。在实现类似性能的同时,重新NET50。 CCT还表现优于许多基于CNN的现代方法,甚至超过一些基于NAS的方法。此外,我们在Flowers-102上获得了新的SOTA,具有99.76%的TOP-1准确性,并改善了Imagenet上现有基线(82.71%精度,具有29%的VIT参数)以及NLP任务。我们针对变压器的简单而紧凑的设计使它们更可行,可以为那些计算资源和/或处理小型数据集的人学习,同时扩展了在数据高效变压器中的现有研究工作。我们的代码和预培训模型可在https://github.com/shi-labs/compact-transformers上公开获得。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolution-free transformers by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
translated by 谷歌翻译
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolutionfree transformers trained on ImageNet only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on ImageNet, and similarly when transferred to other tasks. We will share our code and models.
translated by 谷歌翻译
大型预训练的变压器是现代语义分割基准的顶部,但具有高计算成本和冗长的培训。为了提高这种约束,我们从综合知识蒸馏的角度来研究有效的语义分割,并考虑弥合多源知识提取和特定于变压器特定的斑块嵌入之间的差距。我们提出了基于变压器的知识蒸馏(TransKD)框架,该框架通过蒸馏出大型教师变压器的特征地图和补丁嵌入来学习紧凑的学生变形金刚,绕过长期的预训练过程并将FLOPS降低> 85.0%。具体而言,我们提出了两个基本和两个优化模块:(1)交叉选择性融合(CSF)可以通过通道注意和层次变压器内的特征图蒸馏之间的知识转移; (2)嵌入对齐(PEA)在斑块过程中执行尺寸转换,以促进贴片嵌入蒸馏; (3)全局本地上下文混合器(GL-MIXER)提取了代表性嵌入的全局和局部信息; (4)嵌入助手(EA)是一种嵌入方法,可以无缝地桥接老师和学生模型,并具有老师的渠道数量。关于CityScapes,ACDC和NYUV2数据集的实验表明,TransKD的表现优于最先进的蒸馏框架,并竞争了耗时的预训练方法。代码可在https://github.com/ruipingl/transkd上找到。
translated by 谷歌翻译
We propose a new neural network design paradigm Reversible Column Network (RevCol). The main body of RevCol is composed of multiple copies of subnetworks, named columns respectively, between which multi-level reversible connections are employed. Such architectural scheme attributes RevCol very different behavior from conventional networks: during forward propagation, features in RevCol are learned to be gradually disentangled when passing through each column, whose total information is maintained rather than compressed or discarded as other network does. Our experiments suggest that CNN-style RevCol models can achieve very competitive performances on multiple computer vision tasks such as image classification, object detection and semantic segmentation, especially with large parameter budget and large dataset. For example, after ImageNet-22K pre-training, RevCol-XL obtains 88.2% ImageNet-1K accuracy. Given more pre-training data, our largest model RevCol-H reaches 90.0% on ImageNet-1K, 63.8% APbox on COCO detection minival set, 61.0% mIoU on ADE20k segmentation. To our knowledge, it is the best COCO detection and ADE20k segmentation result among pure (static) CNN models. Moreover, as a general macro architecture fashion, RevCol can also be introduced into transformers or other neural networks, which is demonstrated to improve the performances in both computer vision and NLP tasks. We release code and models at https://github.com/megvii-research/RevCol
translated by 谷歌翻译
视觉变压器(VIT)是计算机视野领域的主导模型。尽管大量研究主要关注处理归纳偏见和复杂性,但仍然存在找到更好的变压器网络的问题。例如,传统的基于变压器的模型通常使用每个查询(Q),键(k)和嵌入多头自我关注之前的键(k)和值(v)的投影层。对语义$ Q,K $和$ V $嵌入不充分考虑可能导致性能下降。在本文中,我们提出了3种$ Q $,k $和$ v $嵌入的三种类型的结构。第一个结构利用两个具有Relu的层,这是$ q,k $和$ v $的非线性嵌入。第二个涉及共享一个非线性层,以在$ q,k $和$ v $之间分享知识。第三种结构与代码参数共享所有非线性层。代码是培训的,值确定要在$ q $,$ k $和$ v $之间执行的嵌入过程。因此,与几种最先进的方法相比,我们展示了实验中提出的方法的优越图像分类性能。该方法在ImageNet-1K数据集中实现了71.4 \%$ 71.4 \%$ 71.4 \%$ xcit-n12的原始变压器模型所需的少数参数(3.1m $)($ 69.9 \%$)。此外,该方法达到了93.3 \%$ 29m $ 5.290万$参数,平均为CIFAR-10,CIFAR-100,斯坦福汽车数据集和STL-10数据集比为92.2 \%的准确性更好通过原始XCIT-N12模型获得$。
translated by 谷歌翻译
视觉变压器由于能够捕获图像中的长期依赖性的能力而成功地应用于图像识别任务。但是,变压器与现有卷积神经网络(CNN)之间的性能和计算成本仍然存在差距。在本文中,我们旨在解决此问题,并开发一个网络,该网络不仅可以超越规范变压器,而且可以超越高性能卷积模型。我们通过利用变压器来捕获长期依赖性和CNN来建模本地特征,从而提出了一个新的基于变压器的混合网络。此外,我们将其扩展为获得一个称为CMT的模型家族,比以前的基于卷积和基于变压器的模型获得了更好的准确性和效率。特别是,我们的CMT-S在ImageNet上获得了83.5%的TOP-1精度,而在拖鞋上的拖曳率分别比现有的DEIT和EficitiveNet小14倍和2倍。拟议的CMT-S还可以很好地概括CIFAR10(99.2%),CIFAR100(91.7%),花(98.7%)以及其他具有挑战性的视觉数据集,例如可可(44.3%地图),计算成本较小。
translated by 谷歌翻译
多层erceptron(MLP),作为出现的第一个神经网络结构,是一个大的击中。但是由硬件计算能力和数据集的大小限制,它一旦沉没了数十年。在此期间,我们目睹了从手动特征提取到带有局部接收领域的CNN的范式转变,以及基于自我关注机制的全球接收领域的变换。今年(2021年),随着MLP混合器的推出,MLP已重新进入敏捷,并吸引了计算机视觉界的广泛研究。与传统的MLP进行比较,它变得更深,但改变了完全扁平化以补丁平整的输入。鉴于其高性能和较少的需求对视觉特定的感应偏见,但社区无法帮助奇迹,将MLP,最简单的结构与全球接受领域,但没有关注,成为一个新的电脑视觉范式吗?为了回答这个问题,本调查旨在全面概述视觉深层MLP模型的最新发展。具体而言,我们从微妙的子模块设计到全局网络结构,我们审查了这些视觉深度MLP。我们比较了不同网络设计的接收领域,计算复杂性和其他特性,以便清楚地了解MLP的开发路径。调查表明,MLPS的分辨率灵敏度和计算密度仍未得到解决,纯MLP逐渐发展朝向CNN样。我们建议,目前的数据量和计算能力尚未准备好接受纯的MLP,并且人工视觉指导仍然很重要。最后,我们提供了开放的研究方向和可能的未来作品的分析。我们希望这项努力能够点燃社区的进一步兴趣,并鼓励目前为神经网络进行更好的视觉量身定制设计。
translated by 谷歌翻译
We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e. shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at https: //github.com/leoxiaobin/CvT.
translated by 谷歌翻译
本文解决了由多头自我注意力(MHSA)中高计算/空间复杂性引起的视觉变压器的低效率缺陷。为此,我们提出了层次MHSA(H-MHSA),其表示以层次方式计算。具体而言,我们首先将输入图像分为通常完成的补丁,每个补丁都被视为令牌。然后,拟议的H-MHSA学习本地贴片中的令牌关系,作为局部关系建模。然后,将小贴片合并为较大的贴片,H-MHSA对少量合并令牌的全局依赖性建模。最后,汇总了本地和全球专注的功能,以获得具有强大表示能力的功能。由于我们仅在每个步骤中计算有限数量的令牌的注意力,因此大大减少了计算负载。因此,H-MHSA可以在不牺牲细粒度信息的情况下有效地模拟令牌之间的全局关系。使用H-MHSA模块合并,我们建立了一个基于层次的变压器网络的家族,即HAT-NET。为了证明在场景理解中HAT-NET的优越性,我们就基本视觉任务进行了广泛的实验,包括图像分类,语义分割,对象检测和实例细分。因此,HAT-NET为视觉变压器提供了新的视角。可以在https://github.com/yun-liu/hat-net上获得代码和预估计的模型。
translated by 谷歌翻译
变压器模型不仅在自然语言处理(NLP)中成功,而且还在计算机视觉(CV)中表现出高潜力。尽管提前很大,但大多数作品只关注建筑的改进,但很少关注分类头。多年来,变压器模型专门用于分类令牌来构建最终分类器,而无明确地利用高级字标记。在本文中,我们提出了一种名为二阶变压器(SOT)的新型变压器模型,同时利用分类器的分类令牌和单词令牌。具体地,我们经验披露了高级词令牌包含丰富的信息,其本身是对分类器非常竞争的,而且与分类令牌互补。为了有效地利用这种丰富的信息,我们提出了具有奇异值功率标准化的多头全球交叉协方差汇集,其符合相似的哲学,因此与变压器块兼容,比常用的汇集方法更好。然后,我们全面地研究了如何将单词令牌与分类令牌进行了解,以构建最终分类头。对于CV任务,我们的SOT显着提高了最先进的视觉变压器,以挑战基准,包括想象成和想象力-A。对于NLP任务,通过基于预磨料语言变压器的微调,我们的SOT大大提高了广泛使用的任务等性能,如可乐和RTE。代码将在https://peihuali.org/sot提供
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
虽然视觉变压器(VT)体系结构在计算机视觉中越来越流行,但纯VT模型在微小的数据集上的性能较差。为了解决这个问题,本文提出了改善小型数据集VT性能的地方指南。我们首先分析,由于VTS中自我注意的机制的高灵活性和内在的全球性,因此很难用有限的数据来学习局部信息,这对于理解图像非常重要。为了促进本地信息,我们通过模仿已经训练有素的卷积神经网络(CNN)的特征来实现VT的当地指南,灵感来自CNN的内置本地到全球层次结构。在我们的双任务学习范式下,由低分辨率图像训练的轻型CNN提供的局部指导足以加速收敛并在很大程度上提高VT的性能。因此,我们的本地指导方法非常简单有效,可以作为小型数据集中VT的基本性能增强方法。广泛的实验表明,我们的方法在小型数据集中从头开始训练时可以显着改善VT,并且与不同种类的VT和数据集兼容。例如,我们提出的方法可以将各种VT在微型数据集上的性能提高(例如,DEIT 13.07%,T2T为8.98%,PVT为7.85%),并使更强大的基线PVTV2提高了1.86%至79.30%,显示出来小型数据集上的VT潜力。该代码可从https://github.com/lkhl/tiny-transformers获得。
translated by 谷歌翻译
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.
translated by 谷歌翻译
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved the new record 65.4 mAP on COCO test-dev. The code will be released at https://github.com/OpenGVLab/InternImage.
translated by 谷歌翻译
我们在视觉变压器上呈现整洁但有效的递归操作,可以提高参数利用而不涉及额外参数。这是通过在变压器网络的深度分享权重来实现的。所提出的方法可以只使用NA \“IVE递归操作来获得大量增益(〜2%),不需要对设计网络原理的特殊或复杂的知识,并引入训练程序的最小计算开销。减少额外的计算通过递归操作,同时保持卓越的准确性,我们通过递归层的多个切片组自行引入近似方法,这可以通过最小的性能损失将成本消耗降低10〜30%。我们称我们的模型切片递归变压器(SRET) ,这与高效视觉变压器的广泛的其他设计兼容。我们最好的模型在含有较少参数的同时,在最先进的方法中对Imagenet建立了重大改进。建议的切片递归操作使我们能够建立一个变压器超过100甚至1000层,仍然仍然小尺寸(13〜15米),以避免困难当模型尺寸太大时,IES在优化中。灵活的可扩展性显示出缩放和构建极深和大维视觉变压器的巨大潜力。我们的代码和模型可在https://github.com/szq0214/sret中找到。
translated by 谷歌翻译
视觉变换器(VTS)作为卷积网络(CNNS)的架构范式替代品。与CNN不同,VT可以捕获图像元素之间的全局关系,并且它们可能具有更大的表示容量。然而,缺乏典型的卷积电感偏差使这些模型比普通的CNN更饥饿。实际上,嵌入在CNN架构设计中的某些本地属性,在VTS中应该从样品中学习。在本文中,我们明确地分析了不同的VTS,比较了他们在小型训练制度中的鲁棒性,并且我们表明,尽管在想象中训练时具有可比的准确性,但它们在较小数据集上的性能可能很大程度上不同。此外,我们提出了一种自我监督的任务,可以从图像中提取其他信息,只有可忽略不计的计算开销。这项任务鼓励VTS学习图像内的空间关系,并使VT培训在训练数据稀缺时更加强劲。我们的任务与标准(监督)培训共同使用,它不依赖于特定的架构选择,因此它可以轻松插入现有的VTS。使用与不同的VTS和数据集进行广泛的评估,我们表明我们的方法可以改善(有时显着地)VTS的最终精度。我们的代码可用于:https://github.com/yhlleo/vts-droc。
translated by 谷歌翻译
变压器是一种基于关注的编码器解码器架构,彻底改变了自然语言处理领域。灵感来自这一重大成就,最近在将变形式架构调整到计算机视觉(CV)领域的一些开创性作品,这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力,与现代卷积神经网络相比在本文中,我们已经为三百不同的视觉变压器进行了全面的审查,用于三个基本的CV任务(分类,检测和分割),提出了根据其动机,结构和使用情况组织这些方法的分类。 。由于培训设置和面向任务的差异,我们还在不同的配置上进行了评估了这些方法,以便于易于和直观的比较而不是各种基准。此外,我们已经揭示了一系列必不可少的,但可能使变压器能够从众多架构中脱颖而出,例如松弛的高级语义嵌入,以弥合视觉和顺序变压器之间的差距。最后,提出了三个未来的未来研究方向进行进一步投资。
translated by 谷歌翻译