Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to apply multiple augmentation views to the image, as well as introducing instance contrastive learning tasks between these views into the CLIP framework. Compared to other CLIP improvements that combine different pre-training targets such as SLIP and MaskCLIP, our method is not only more effective, but also much more efficient. Specifically, using ViT-B and YFCC-15M dataset, our approach achieves $43.9\%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy on Flickr30K and MS COCO, which are $+1.1\%$, $+5.5/+0.9$, and $+4.4/+1.3$ higher than the SLIP method, while being $2.30\times$ faster. An efficient version of our approach running $1.16\times$ faster than the plain CLIP model achieves significant gains of $+5.3\%$, $+11.3/+8.0$, and $+9.5/+4.9$ on these benchmarks.
translated by 谷歌翻译
在本文中,我们对检测变压器(DETR)感兴趣,这是一种基于变压器编码器编码器架构的端到端对象检测方法,而无需手工制作的后处理,例如NMS。受到有条件的Detr的启发,这是一种具有快速训练收敛性的改进的DETR,对内部解码器层提出了盒子查询(最初称为空间查询),我们将对象查询重新将对象查询重新布置为盒子查询的格式,该格式是参考参考嵌入的组成点和框相对于参考点的转换。该重新制定表明在更快地使用R-CNN中广泛研究的DETR中的对象查询与锚固框之间的联系。此外,我们从图像内容中学习了盒子查询,从而进一步提高了通过快速训练收敛的有条件DETR的检测质量。此外,我们采用轴向自我注意的想法来节省内存成本并加速编码器。所得的检测器(称为条件DETR V2)取得比条件DETR更好的结果,可节省内存成本并更有效地运行。例如,对于DC $ 5 $ -Resnet- $ 50 $骨干,我们的方法在可可$ Val $ set上获得了$ 44.8 $ ap,$ 16.4 $ fps和有条件的detr相比,它运行了$ 1.6 \ tims $ $ $ $ $,节省$ 74 $ \ \ \ \ \ \ \ \ \ \ \ \ \ $ 74美元总体内存成本的百分比,并提高$ 1.0 $ ap得分。
translated by 谷歌翻译
零击学习(ZSL)旨在识别培训集中没有样本的类。一种代表性的解决方案是直接学习将视觉特征与相应的类语义相关联的嵌入函数,以识别新类。许多方法扩展了这种解决方案,最近的方法特别热衷于从图像中提取丰富的特征,例如属性功能。这些属性特征通常在每个单独的图像中提取;但是,不强调跨图像的特征的共同特征。在本文中,我们提出了一个新的框架来通过明确学习原型超出图像来提高ZSL,并用图像中的属性级特征对其进行对比优化它们。除了新颖的体系结构外,还针对属性表示强调了两个元素:新的原型生成模块旨在从属性语义生成属性原型;引入了基于硬示例的对比优化方案,以增强嵌入空间中的属性级特征。我们探索了两个基于CNN的替代骨干,基于CNN的骨干,以在三个标准基准测试(Cub,Sun,Awa2)上构建我们的框架并进行实验。这些基准测试的结果表明,我们的方法通过相当大的利润来改善艺术的状态。我们的代码将在https://github.com/dyabel/coar-zsl.git上找到
translated by 谷歌翻译
诸如剪辑之类的对比视觉模型在转移学习方面已显示出巨大进展。在推理阶段,需要仔细设计适当的文本描述,也称为提示,以正确地对给定的图像进行分类。为了避免繁琐的及时工程,最近的作品,例如Coop,Clip-Audapter和Tip-Adapter,建议将视觉模型改编成下游图像识别任务,以在一小部分标记的数据上。尽管实现了有希望的改进,但是需要来自目标数据集的标记数据可能会限制可扩展性。在本文中,我们探讨了一种不同的情况,在该场景中,目标数据集的标签未经证实,并提出了一种无监督的及时学习方法(UPL)方法,以避免及时工程,同时改善类似夹子的视觉模型的传递性能。据我们所知,UPL是第一项将无监督学习引入及时学习的工作。在实验上,我们的UPL在ImageNet以及其他10个数据集上及时使用及时的工程剪辑优于原始剪辑。增强版本的UPL甚至与大多数数据集的8-Shot Coop和8-Shot Tip-Adapter都具有竞争力。代码和型号可在https://github.com/tonyhuang2022/upl上找到。
translated by 谷歌翻译
最近,Vision-Language预训练的零拍图像分类已经表现出令人难以置信的成就,即该模型可以对任意类别进行分类而不看到该类别的其他注释图像。然而,目前尚不清楚如何在更广泛的视觉问题上进行零射识别,例如对象检测和语义分割。在本文中,我们通过在现成的预训练的视觉模型,即剪辑上建立零拍语义分割来定位零拍语义分割。很难因为语义分割和剪辑模型在不同的视觉粒度上执行,该语义分段处理在像素上时,而剪辑在图像上执行。为了解决处理粒度的差异,我们拒绝使用普遍的一级FCN基于FCN的框架,并倡导一个两级语义分割框架,其中第一阶段提取一个完全提取的掩模提案和第二阶段利用基于图像的剪辑模型在第一阶段生成的蒙版图像作物上执行零拍分类。我们的实验结果表明,这种简单的框架通过大型利润率超越了先前的最先进:+29.5 Hiou On Pascal VOC 2012 DataSet,+8.9 Hiou On Coco Stuff DataSet。凭借其简单性和强大的表现,我们希望本框架成为促进未来研究的基准。
translated by 谷歌翻译
由于数据注释的高成本,半监督行动识别是一个具有挑战性的,但重要的任务是。这个问题的常见方法是用伪标签分配未标记的数据,然后将其作为训练中的额外监督。通常在最近的工作中,通过在标记数据上训练模型来获得伪标签,然后使用模型的自信预测来教授自己。在这项工作中,我们提出了一种更有效的伪标签方案,称为跨模型伪标记(CMPL)。具体地,除了主要骨干内,我们还介绍轻量级辅助网络,并要求他们互相预测伪标签。我们观察到,由于其不同的结构偏差,这两种模型倾向于学习来自同一视频剪辑的互补表示。因此,通过利用跨模型预测作为监督,每个模型都可以受益于其对应物。对不同数据分区协议的实验表明我们对现有替代方案框架的重大改进。例如,CMPL在Kinetics-400和UCF-101上实现了17.6 \%$ 17.6 \%$ 25.1 \%$ 25.使用RGB模态和1 \%$标签数据,优于我们的基线模型,FIXMATCT,以$ 9.0 \% $和10.3美元\%$。
translated by 谷歌翻译
对于人类的行动理解,流行的研究方向是分析具有明确的语义含量的短视频剪辑,例如跳跃和饮酒。然而,了解短语行动的方法不能直接翻译成长期以来的人类动态,如跳舞,即使在语义上也是挑战的挑战。同时,自然语言处理(NLP)社区通过大规模预培训解决了稀缺的类似挑战,这改善了一种模型的几个下游任务。在这项工作中,我们研究如何以自我监督的方式进行分段和群集视频,即Acton Discovery,朝向视频标记的主要障碍。我们提出了一种两级框架,首先通过对应于它们的时间上下文的视频帧的两个增强视图对比其次的视频帧的两个增强视图来获得帧智表示。然后通过k-means群集视频集集中的帧展表示。然后通过从同一簇内的帧形成连续的运动序列来自动提取actons。通过标准化的相互信息和语言熵,我们通过Kendall的Tau和Lexicon构建步骤进行评估框架明智的表现。我们还研究了这个标记化的三种应用:类型分类,行动细分和行动组成。在AIST ++和PKU-MMD数据集上,与几个基线相比,Actons带来了显着的性能改进。
translated by 谷歌翻译
我们介绍混音,一个用于对象检测的新培训范例,可以免费提高现有探测器的性能。混合通过利用不同优点的增强来增强数据增强,同时排除某些可能对培训可能有害的培训样本的强大增强。此外,它通过结合可以补偿这些错误的伪框来解决人类注释中的本地化噪声和丢失标签。通过对探测器的自动启动,可以使用这些混音功能,这可以用于预测对强大增强的训练难度,以及由于神经网络对标记错误的鲁棒性而产生可靠的伪框。发现混音是在Coco DataSet上的各种探测器上带来一致的改进。特别是,使用Reset-50 \ Cite {REN2015Faster}更快的R-CNN \ CITE {REN2015FAST}骨架的性能从41.7地图改进到44.0地图,以及CASCADE-RCNN \ CITE {CAI2018CASCADE}的准确性-small \ cite {liu2021swin}骨干从50.9地图提出到52.8地图。代码和模型将在\ url {https://github.com/mendelxu/mixtraining}上公开可用。
translated by 谷歌翻译
The recent progress of CNN has dramatically improved face alignment performance. However, few works have paid attention to the error-bias with respect to error distribution of facial landmarks. In this paper, we investigate the error-bias issue in face alignment, where the distributions of landmark errors tend to spread along the tangent line to landmark curves. This error-bias is not trivial since it is closely connected to the ambiguous landmark labeling task. Inspired by this observation, we seek a way to leverage the error-bias property for better convergence of CNN model. To this end, we propose anisotropic direction loss (ADL) and anisotropic attention module (AAM) for coordinate and heatmap regression, respectively. ADL imposes strong binding force in normal direction for each landmark point on facial boundaries. On the other hand, AAM is an attention module which can get anisotropic attention mask focusing on the region of point and its local edge connected by adjacent points, it has a stronger response in tangent than in normal, which means relaxed constraints in the tangent. These two methods work in a complementary manner to learn both facial structures and texture details. Finally, we integrate them into an optimized end-to-end training pipeline named ADNet. Our ADNet achieves state-of-the-art results on 300W, WFLW and COFW datasets, which demonstrates the effectiveness and robustness.
translated by 谷歌翻译