智能论文笔记

ResFormer: Scaling ViTs with Multi-Resolution Training

Rui Tian , Zuxuan Wu , Qi Dai , Han Hu , Yu Qiao , Yu-Gang Jiang

分类：计算机视觉

2022-12-01

Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. This allows ResFormer to cope with novel resolutions effectively. We conduct extensive experiments for image classification on ImageNet. The results provide strong quantitative evidence that ResFormer has promising scaling abilities towards a wide range resolutions. For instance, ResFormer-B-MR achieves a Top-1 accuracy of 75.86% and 81.72% when evaluated on relatively low and high resolutions respectively (i.e., 96 and 640), which are 48% and 7.49% better than DeiT-B. We also demonstrate, among other things, ResFormer is flexible and can be easily extended to semantic segmentation and video action recognition.

translated by 谷歌翻译

Improved Multiscale Vision Transformers for Classification and Detection

Yanghao Li , Chao-Yuan Wu , Haoqi Fan , Karttikeya Mangalam , Bo Xiong , Jitendra Malik , Christoph Feichtenhofer

分类：计算机视觉

2021-12-02

在本文中，我们将多尺度视觉变压器（MVIT）作为图像和视频分类的统一架构，以及对象检测。我们提出了一种改进的MVIT版本，它包含分解的相对位置嵌入和残余汇集连接。我们以五种尺寸实例化此架构，并评估Imagenet分类，COCO检测和动力学视频识别，在此优先效果。我们进一步比较了MVITS的汇集注意力来窗口注意力机制，其中它在准确性/计算中优于后者。如果没有钟声，MVIT在3个域中具有最先进的性能：ImageNet分类的准确性为88.8％，Coco对象检测的56.1盒AP和动力学-400视频分类的86.1％。代码和模型将公开可用。

translated by 谷歌翻译

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Xiaoyi Dong , Jianmin Bao , Dongdong Chen , Weiming Zhang , Nenghai Yu , Lu Yuan , Dong Chen , Baining Guo

分类：计算机视觉 | 机器学习

2021-07-01

我们介绍克斯内变压器，一种高效且有效的变压器的骨干，用于通用视觉任务。变压器设计的具有挑战性的问题是，全球自我关注来计算成本昂贵，而局部自我关注经常限制每个令牌的相互作用。为了解决这个问题，我们开发了以平行的横向和垂直条纹在水平和垂直条纹中计算自我关注的交叉形窗口自我关注机制，通过将输入特征分成相等的条纹而获得的每个条纹宽度。我们提供了条纹宽度效果的数学分析，并改变变压器网络的不同层的条纹宽度，这在限制计算成本时实现了强大的建模能力。我们还介绍了本地增强的位置编码（LEPE），比现有的编码方案更好地处理本地位置信息。 LEPE自然支持任意输入分辨率，因此对下游任务特别有效和友好。 CSWIN变压器并入其具有这些设计和分层结构，展示了普通愿景任务的竞争性能。具体来说，它在ImageNet-1K上实现了85.4 \％Top-1精度，而无需任何额外的培训数据或标签，53.9盒AP和46.4掩模AP，ADE20K语义分割任务上的52.2 Miou，超过以前的状态 - 在类似的拖鞋设置下，艺术品+1.2，+2.0，+1.4和+2.0分别为+1.2，+2.0，+1.4和+2.0。通过在较大的数据集Imagenet-21k上进行前预先预订，我们在Ave20K上实现了87.5％的成像-1K和高分性能，55.7 miou。代码和模型可在https://github.com/microsoft/cswin-transformer中找到。

translated by 谷歌翻译

ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer

Rui Yang , Hailong Ma , Jie Wu , Yansong Tang , Xuefeng Xiao , Min Zheng , Xiu Li

分类：计算机视觉 | 人工智能

2022-03-21

香草自我注意的机制固有地依赖于预定和坚定的计算维度。这种僵化的性限制了它具有面向上下文的概括，可以带来更多的上下文提示和全球表示。为了减轻此问题，我们提出了一种可扩展的自我注意（SSA）机制，该机制利用两个缩放因素来释放查询，键和价值矩阵的维度，同时使它们不符合输入。这种可伸缩性可获得面向上下文的概括并增强对象灵敏度，从而将整个网络推向准确性和成本之间的更有效的权衡状态。此外，我们提出了一个基于窗口的自我注意事项（IWSA），该自我注意力（IWSA）通过重新合并独立的值代币并从相邻窗口中汇总空间信息来建立非重叠区域之间的相互作用。通过交替堆叠SSA和IWSA，可扩展的视觉变压器（可伸缩率）在通用视觉任务中实现最先进的性能。例如，在Imagenet-1K分类中，可伸缩率S的表现优于双胞胎-SVT-S，而Swin-T则比1.4％。

translated by 谷歌翻译

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu , Yutong Lin , Yue Cao , Han Hu , Yixuan Wei , Zheng Zhang , Stephen Lin , Baining Guo

分类：

2021-03-25

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-theart by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github. com/microsoft/Swin-Transformer.

translated by 谷歌翻译

MPViT: Multi-Path Vision Transformer for Dense Prediction

Youngwan Lee , Jonghee Kim , Jeff Willette , Sung Ju Hwang

分类：计算机视觉

2021-12-21

诸如对象检测和分割等密集的计算机视觉任务需要有效的多尺度特征表示，用于检测或分类具有不同大小的对象或区域。虽然卷积神经网络（CNNS）是这种任务的主导架构，但最近引入了视觉变压器（VITS）的目标是将它们替换为骨干。类似于CNN，VITS构建一个简单的多级结构（即，细致粗略），用于使用单尺度补丁进行多尺度表示。在这项工作中，通过从现有变压器的不同角度来看，我们探索了多尺度补丁嵌入和多路径结构，构建了多路径视觉变压器（MPVIT）。 MPVIT通过使用重叠的卷积贴片嵌入，将相同尺寸〜（即，序列长度，序列长度，序列长度的序列长度）嵌入不同尺度的斑块。然后，通过多个路径独立地将不同尺度的令牌独立地馈送到变压器编码器，并且可以聚合产生的特征，使得能够在相同特征级别的精细和粗糙的特征表示。由于多样化，多尺寸特征表示，我们的MPVits从微小〜（5m）缩放到基础〜（73米）一直在想象成分，对象检测，实例分段上的最先进的视觉变压器来实现卓越的性能，和语义细分。这些广泛的结果表明，MPVIT可以作为各种视觉任务的多功能骨干网。代码将在\ url {https://git.io/mpvit}上公开可用。

translated by 谷歌翻译

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang , Enze Xie , Xiang Li , Deng-Ping Fan , Kaitao Song , Ding Liang , Tong Lu , Ping Luo , Ling Shao

分类：

2021-02-24

ous vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. (3) We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.

translated by 谷歌翻译

RegionViT: Regional-to-Local Attention for Vision Transformers

Chun-Fu Chen , Rameswar Panda , Quanfu Fan

分类：计算机视觉

2021-06-04

vision变压器（VIT）最近在图像分类上实现了对卷积神经网络（CNNS）的可比结果的强大能力。然而，Vanilla Vit只是直接从自然语言处理继承相同的架构，这通常不会针对视觉应用进行优化。在这篇文章的推动中，我们提出了一种采用金字塔结构的新架构，并在视觉变压器中采用新的区域到局部关注，而不是全球自我关注。更具体地，我们的模型首先从具有不同补丁大小的图像生成区域令牌和本地标记，其中每个区域令牌与基于空间位置的一组本地代币相关联。区域到当地的注意力包括两个步骤：第一，区域自我关注提取所有区域代币之间的全球信息，然后通过自我关注将局部自我关注与相关的本地代币之间的信息交换。因此，尽管局部自我关注限制了当地区域的范围，但它仍然可以接收全球信息。在四个视觉任务中进行广泛的实验，包括图像分类，对象和关键点检测，语义分割和动作识别，表明我们的方法优于或与最先进的Vit变体（包括许多并发作品）的差异。我们的源代码和模型可在https://github.com/ibm/regionvit上使用。

translated by 谷歌翻译

Shunted Self-Attention via Multi-Scale Token Aggregation

Sucheng Ren , Daquan Zhou , Shengfeng He , Jiashi Feng , Xinchao Wang

分类：计算机视觉

2021-11-30

近期视觉变压器〜（VIT）模型在各种计算机视觉任务中展示了令人鼓舞的结果，因为他们的竞争力通过自我关注建模图像补丁或令牌的长距离依赖性。然而，这些模型通常指定每层中每个令牌特征的类似场景。这种约束不可避免地限制了每个自我注意层在捕获多尺度特征中的能力，从而导致处理具有不同尺度的多个对象的图像的性能下降。为了解决这个问题，我们提出了一种新颖和通用的策略，称为分流的自我关注〜（SSA），它允许VITS为每个关注层的混合秤的关注进行模拟。 SSA的关键概念是将异构接收领域的尺寸注入令牌：在计算自我注意矩阵之前，它选择性地合并令牌以表示较大的对象特征，同时保持某些令牌以保持细粒度的特征。这种新颖的合并方案能够自我注意，以了解具有不同大小的对象之间的关系，并同时降低令牌数字和计算成本。各种任务的广泛实验表明了SSA的优越性。具体而言，基于SSA的变压器实现了84.0 \％的前1个精度，并且在ImageNet上占据了最先进的焦距变压器，只有一半的模型尺寸和计算成本，并且在Coco上超过了焦点变压器1.3映射2.9 MIOU在ADE20K上类似参数和计算成本。代码已在https://github.com/oliverrensu/shunted-transformer发布。

translated by 谷歌翻译

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Xiangxiang Chu , Zhi Tian , Yuqing Wang , Bo Zhang , Haibing Ren , Xiaolin Wei , Huaxia Xia , Chunhua Shen

分类：

2021-04-28

Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully devised yet simple spatial attention mechanism performs favorably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our Code is available at: https://git.io/Twins.

translated by 谷歌翻译

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

Yongming Rao , Zuyan Liu , Wenliang Zhao , Jie Zhou , Jiwen Lu

分类：计算机视觉 | 人工智能 | 机器学习

2022-07-04

在本文中，我们通过利用视觉数据中的空间稀疏性提出了一种新的模型加速方法。我们观察到，视觉变压器中的最终预测仅基于最有用的令牌的子集，这足以使图像识别。基于此观察，我们提出了一个动态的令牌稀疏框架，以根据加速视觉变压器的输入逐渐和动态地修剪冗余令牌。具体而言，我们设计了一个轻量级预测模块，以估计给定当前功能的每个令牌的重要性得分。该模块被添加到不同的层中以层次修剪冗余令牌。尽管该框架的启发是我们观察到视觉变压器中稀疏注意力的启发，但我们发现自适应和不对称计算的想法可能是加速各种体系结构的一般解决方案。我们将我们的方法扩展到包括CNN和分层视觉变压器在内的层次模型，以及更复杂的密集预测任务，这些任务需要通过制定更通用的动态空间稀疏框架，并具有渐进性的稀疏性和非对称性计算，用于不同空间位置。通过将轻质快速路径应用于少量的特征，并使用更具表现力的慢速路径到更重要的位置，我们可以维护特征地图的结构，同时大大减少整体计算。广泛的实验证明了我们框架对各种现代体系结构和不同视觉识别任务的有效性。我们的结果清楚地表明，动态空间稀疏为模型加速提供了一个新的，更有效的维度。代码可从https://github.com/raoyongming/dynamicvit获得

translated by 谷歌翻译

GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation

Chenhongyi Yang , Jiarui Xu , Shalini De Mello , Elliot J. Crowley , Xiaolong Wang

分类：计算机视觉 | 机器学习

2022-12-13

We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require high-resolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters. Code and pre-trained models are available at https://github.com/ChenhongyiYang/GPViT .

translated by 谷歌翻译

CMT: Convolutional Neural Networks Meet Vision Transformers

Jianyuan Guo , Kai Han , Han Wu , Yehui Tang , Xinghao Chen , Yunhe Wang , Chang Xu

分类：计算机视觉

2021-07-13

视觉变压器由于能够捕获图像中的长期依赖性的能力而成功地应用于图像识别任务。但是，变压器与现有卷积神经网络（CNN）之间的性能和计算成本仍然存在差距。在本文中，我们旨在解决此问题，并开发一个网络，该网络不仅可以超越规范变压器，而且可以超越高性能卷积模型。我们通过利用变压器来捕获长期依赖性和CNN来建模本地特征，从而提出了一个新的基于变压器的混合网络。此外，我们将其扩展为获得一个称为CMT的模型家族，比以前的基于卷积和基于变压器的模型获得了更好的准确性和效率。特别是，我们的CMT-S在ImageNet上获得了83.5％的TOP-1精度，而在拖鞋上的拖曳率分别比现有的DEIT和EficitiveNet小14倍和2倍。拟议的CMT-S还可以很好地概括CIFAR10（99.2％），CIFAR100（91.7％），花（98.7％）以及其他具有挑战性的视觉数据集，例如可可（44.3％地图），计算成本较小。

translated by 谷歌翻译

Vision Transformers with Hierarchical Attention

Yun Liu , Yu-Huan Wu , Guolei Sun , Le Zhang , Ajad Chhatkuli , Luc Van Gool

分类：计算机视觉

2021-06-06

本文解决了由多头自我注意力（MHSA）中高计算/空间复杂性引起的视觉变压器的低效率缺陷。为此，我们提出了层次MHSA（H-MHSA），其表示以层次方式计算。具体而言，我们首先将输入图像分为通常完成的补丁，每个补丁都被视为令牌。然后，拟议的H-MHSA学习本地贴片中的令牌关系，作为局部关系建模。然后，将小贴片合并为较大的贴片，H-MHSA对少量合并令牌的全局依赖性建模。最后，汇总了本地和全球专注的功能，以获得具有强大表示能力的功能。由于我们仅在每个步骤中计算有限数量的令牌的注意力，因此大大减少了计算负载。因此，H-MHSA可以在不牺牲细粒度信息的情况下有效地模拟令牌之间的全局关系。使用H-MHSA模块合并，我们建立了一个基于层次的变压器网络的家族，即HAT-NET。为了证明在场景理解中HAT-NET的优越性，我们就基本视觉任务进行了广泛的实验，包括图像分类，语义分割，对象检测和实例细分。因此，HAT-NET为视觉变压器提供了新的视角。可以在https://github.com/yun-liu/hat-net上获得代码和预估计的模型。

translated by 谷歌翻译

Vision Transformer with Deformable Attention

Zhuofan Xia , Xuran Pan , Shiji Song , Li Erran Li , Gao Huang

分类：计算机视觉

2022-01-03

变压器最近在各种视觉任务上表现出卓越的性能。大型有时甚至全球，接收领域赋予变换器模型，并通过其CNN对应物具有更高的表示功率。然而，简单地扩大接收领域也产生了几个问题。一方面，使用致密的注意，例如，在VIT中，导致过度的记忆和计算成本，并且特征可以受到超出兴趣区域的无关紧要的影响。另一方面，PVT或SWIN变压器采用的稀疏注意是数据不可知论，可能会限制模拟长距离关系的能力。为了缓解这些问题，我们提出了一种新型可变形的自我关注模块，其中以数据相关的方式选择密钥和值对中的密钥和值对的位置。这种灵活的方案使自我关注模块能够专注于相关区域并捕获更多的信息性功能。在此基础上，我们呈现可变形的关注变压器，一般骨干模型，具有可变形关注的图像分类和密集预测任务。广泛的实验表明，我们的模型在综合基准上实现了一致的改善结果。代码可在https://github.com/leaplabthu/dat上获得。

translated by 谷歌翻译

Training data-efficient image transformers & distillation through attention

Hugo Touvron , Matthieu Cord , Matthijs Douze , Francisco Massa , Alexandre Sablayrolles , Hervé Jégou

分类：

2020-12-23

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolution-free transformers by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

translated by 谷歌翻译

P2T: Pyramid Pooling Transformer for Scene Understanding

Yu-Huan Wu , Yun Liu , Xin Zhan , Ming-Ming Cheng

分类：计算机视觉

2021-06-22

最近，Vision Transformer通过推动各种视觉任务的最新技术取得了巨大的成功。视觉变压器中最具挑战性的问题之一是，图像令牌的较大序列长度会导致高计算成本（二次复杂性）。解决此问题的一个流行解决方案是使用单个合并操作来减少序列长度。本文考虑如何改善现有的视觉变压器，在这种变压器中，单个合并操作提取的合并功能似乎不太强大。为此，我们注意到，由于其在上下文抽象中的强大能力，金字塔池在各种视觉任务中已被证明是有效的。但是，在骨干网络设计中尚未探索金字塔池。为了弥合这一差距，我们建议在视觉变压器中将金字塔池汇总到多头自我注意力（MHSA）中，同时降低了序列长度并捕获强大的上下文特征。我们插入了基于池的MHSA，我们构建了一个通用视觉变压器主链，称为金字塔池变压器（P2T）。广泛的实验表明，与先前的基于CNN-和基于变压器的网络相比，当将P2T用作骨干网络时，它在各种视觉任务中显示出很大的优势。该代码将在https://github.com/yuhuan-wu/p2t上发布。

translated by 谷歌翻译

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Jiemin Fang , Lingxi Xie , Xinggang Wang , Xiaopeng Zhang , Wenyu Liu , Qi Tian

分类：计算机视觉 | 机器学习

2021-05-31

变压器提供了一种设计神经网络以进行视觉识别的新方法。与卷积网络相比，变压器享有在每个阶段引用全局特征的能力，但注意模块带来了更高的计算开销，阻碍了变压器的应用来处理高分辨率的视觉数据。本文旨在减轻效率和灵活性之间的冲突，为此，我们为每个地区提出了专门的令牌，作为使者（MSG）。因此，通过操纵这些MSG令牌，可以在跨区域灵活地交换视觉信息，并且减少计算复杂性。然后，我们将MSG令牌集成到一个名为MSG-Transformer的多尺度体系结构中。在标准图像分类和对象检测中，MSG变压器实现了竞争性能，加速了GPU和CPU的推断。代码可在https://github.com/hustvl/msg-transformer中找到。

translated by 谷歌翻译

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

Wuyang Chen , Xianzhi Du , Fan Yang , Lucas Beyer , Xiaohua Zhai , Tsung-Yi Lin , Huizhong Chen , Jing Li , Xiaodan Song , Zhangyang Wang

分类：计算机视觉

2021-12-17

这项工作介绍了一个简单的视觉变压器设计，作为对象本地化和实例分段任务的强大基线。变压器最近在图像分类任务中展示了竞争性能。为了采用对象检测和密集的预测任务，许多作品从卷积网络和高度定制的Vit架构继承了多级设计。在这种设计背后，目标是在计算成本和多尺度全球背景的有效聚合之间进行更好的权衡。然而，现有的作品采用多级架构设计作为黑匣子解决方案，无清楚地了解其真正的益处。在本文中，我们全面研究了三个架构设计选择对vit - 空间减少，加倍的频道和多尺度特征 - 并证明了vanilla vit架构可以在没有手动的多尺度特征的情况下实现这一目标，保持原始的Vit设计哲学。我们进一步完成了缩放规则，以优化模型的准确性和计算成本/型号大小的权衡。通过在整个编码器块中利用恒定的特征分辨率和隐藏大小，我们提出了一种称为通用视觉变压器（UVIT）的简单而紧凑的VIT架构，可实现COCO对象检测和实例分段任务的强劲性能。

translated by 谷歌翻译

CvT: Introducing Convolutions to Vision Transformers

Haiping Wu , Bin Xiao , Noel Codella , Mengchen Liu , Xiyang Dai , Lu Yuan , Lei Zhang

分类：

2021-03-29

We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e. shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at https: //github.com/leoxiaobin/CvT.

translated by 谷歌翻译