In person re-identification (ReID) tasks, many works explore the learning of part features to improve the performance over global image features. Existing methods extract part features in an explicit manner, by either using a hand-designed image division or keypoints obtained with external visual systems. In this work, we propose to learn Discriminative implicit Parts (DiPs) which are decoupled from explicit body parts. Therefore, DiPs can learn to extract any discriminative features that can benefit in distinguishing identities, which is beyond predefined body parts (such as accessories). Moreover, we propose a novel implicit position to give a geometric interpretation for each DiP. The implicit position can also serve as a learning signal to encourage DiPs to be more position-equivariant with the identity in the image. Lastly, a set of attributes and auxiliary losses are introduced to further improve the learning of DiPs. Extensive experiments show that the proposed method achieves state-of-the-art performance on multiple person ReID benchmarks.
translated by 谷歌翻译
在本文中,我们考虑了通用视觉对象计数的问题,其目的是开发一种计算模型,用于使用任意数量的“示例”,即零射击或几次计数来计算任意语义类别的对象数量。为此,我们做出以下四个贡献:(1)我们引入了一种基于变压器的新型架构,用于广义视觉对象计数,称为计数变压器(乡村),该架构明确捕获图像贴片或给定的“示例”之间的相似性,通过注意机制;(2)我们采用了两阶段的训练制度,首先通过自我监督的学习预先培训模型,然后进行监督的微调;(3)我们提出了一个简单,可扩展的管道,以合成合成用大量实例或不同语义类别的训练图像明确迫使模型使用给定的“示例”;(4)我们对大规模计数基准的彻底消融研究,例如FSC-147,并在零和少数设置上展示了最先进的性能。
translated by 谷歌翻译
这项工作旨在使用带有动作查询的编码器框架(类似于DETR)来推进时间动作检测(TAD),该框架在对象检测中表现出了巨大的成功。但是,如果直接应用于TAD,该框架遇到了几个问题:解码器中争论之间关系的探索不足,由于培训样本数量有限,分类培训不足以及推断时不可靠的分类得分。为此,我们首先提出了解码器中的关系注意机制,该机制根据其关系来指导查询之间的注意力。此外,我们提出了两项​​损失,以促进和稳定行动分类的培训。最后,我们建议在推理时预测每个动作查询的本地化质量,以区分高质量的查询。所提出的命名React的方法在Thumos14上实现了最新性能,其计算成本比以前的方法低得多。此外,还进行了广泛的消融研究,以验证每个提出的组件的有效性。该代码可在https://github.com/sssste/reaeact上获得。
translated by 谷歌翻译
这项工作的目的是使用零手动注释建立可扩展的管道,以将对象检测器扩展到新颖/看不见的类别。为此,我们做出以下四个贡献:(i)追求概括,我们提出了一个两阶段的开放式摄制对象检测器,其中类无形的对象建议与预先训练的视觉视觉训练的文本编码一起分类语言模型; (ii)要将视觉潜在空间(RPN框建议)与预训练的文本编码器配对,我们提出了区域提示的概念,以学习将文本嵌入空间与区域视觉对象特征相结合; (iii)为了扩展学习过程以检测更广泛的对象,我们通过新颖的自我训练框架利用可用的在线资源,该框架允许在嘈杂的未经图像的网络图像上训练所提出的检测器。最后,(iv)评估我们所提出的检测器,称为及时插图,我们对具有挑战性的LVI和MS-COCO数据集进行了广泛的实验。提示件表现出优于现有方法的卓越性能,而其他培训图像和零手动注释较少。带代码的项目页面:https://fcjian.github.io/promptdet。
translated by 谷歌翻译
这项工作旨在改善具有自我监督的实例检索。我们发现使用最近开发的自我监督(SSL)学习方法(如SIMCLR和MOCO)的微调未能提高实例检索的性能。在这项工作中,我们确定了例如检索的学习表示应该是不变的视点和背景等的大变化,而当前SSL方法应用的自增强阳性不能为学习强大的实例级别表示提供强大的信号。为了克服这个问题,我们提出了一种在\ texit {实例级别}对比度上建立的新SSL方法,以通过动态挖掘迷你批次和存储库来学习类内不变性训练。广泛的实验表明,insclr在实例检索上实现了比最先进的SSL方法更类似或更好的性能。代码可在https://github.com/zeludeng/insclr获得。
translated by 谷歌翻译
变压器在许多视觉任务上表现出优选的性能。然而,对于人的任务重新识别(Reid),Vanilla变形金刚将丰富的背景留下了高阶特征关系,这是由于行人的戏剧性变化而不足的局部特征细节。在这项工作中,我们提出了一个全部关系高阶变压器(OH-Figrain)来模拟Reid的全系关系功能。首先,为了加强视觉表示的能力,而不是基于每个空间位置的对查询和隔离键获得注意矩阵,我们进一步逐步以模拟非本地机制的高阶统计信息。我们以先前的混合机制在每个订单的相应层中共享注意力,以降低计算成本。然后,提出了一种基于卷积的本地关系感知模块来提取本地关系和2D位置信息。我们模型的实验结果是优越的有前途,其在市场上显示出最先进的性能-1501,Dukemtmc,MSMT17和occluded-Duke数据集。
translated by 谷歌翻译
在NAS领域中,可分构造的架构搜索是普遍存在的,因为它的简单性和效率,其中两个范例,多路径算法和单路径方法主导。多路径框架(例如,DARTS)是直观的,但遭受内存使用和培训崩溃。单路径方法(例如,e.g.gdas和proxylesnnas)减轻了内存问题并缩小了搜索和评估之间的差距,但牺牲了性能。在本文中,我们提出了一种概念上简单的且有效的方法来桥接这两个范式,称为相互意识的子图可差架构搜索(MSG-DAS)。我们框架的核心是一个可分辨动的Gumbel-Topk采样器,它产生多个互斥的单路径子图。为了缓解多个子图形设置所带来的Severer Skip-Connect问题,我们提出了一个Dropblock-Identity模块来稳定优化。为了充分利用可用的型号(超级网和子图),我们介绍了一种记忆高效的超净指导蒸馏,以改善培训。所提出的框架击中了灵活的内存使用和搜索质量之间的平衡。我们展示了我们在想象中和CIFAR10上的方法的有效性,其中搜索的模型显示了与最近的方法相当的性能。
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Although deep learning has made remarkable progress in processing various types of data such as images, text and speech, they are known to be susceptible to adversarial perturbations: perturbations specifically designed and added to the input to make the target model produce erroneous output. Most of the existing studies on generating adversarial perturbations attempt to perturb the entire input indiscriminately. In this paper, we propose ExploreADV, a general and flexible adversarial attack system that is capable of modeling regional and imperceptible attacks, allowing users to explore various kinds of adversarial examples as needed. We adapt and combine two existing boundary attack methods, DeepFool and Brendel\&Bethge Attack, and propose a mask-constrained adversarial attack system, which generates minimal adversarial perturbations under the pixel-level constraints, namely ``mask-constraints''. We study different ways of generating such mask-constraints considering the variance and importance of the input features, and show that our adversarial attack system offers users good flexibility to focus on sub-regions of inputs, explore imperceptible perturbations and understand the vulnerability of pixels/regions to adversarial attacks. We demonstrate our system to be effective based on extensive experiments and user study.
translated by 谷歌翻译
Recently the deep learning has shown its advantage in representation learning and clustering for time series data. Despite the considerable progress, the existing deep time series clustering approaches mostly seek to train the deep neural network by some instance reconstruction based or cluster distribution based objective, which, however, lack the ability to exploit the sample-wise (or augmentation-wise) contrastive information or even the higher-level (e.g., cluster-level) contrastiveness for learning discriminative and clustering-friendly representations. In light of this, this paper presents a deep temporal contrastive clustering (DTCC) approach, which for the first time, to our knowledge, incorporates the contrastive learning paradigm into the deep time series clustering research. Specifically, with two parallel views generated from the original time series and their augmentations, we utilize two identical auto-encoders to learn the corresponding representations, and in the meantime perform the cluster distribution learning by incorporating a k-means objective. Further, two levels of contrastive learning are simultaneously enforced to capture the instance-level and cluster-level contrastive information, respectively. With the reconstruction loss of the auto-encoder, the cluster distribution loss, and the two levels of contrastive losses jointly optimized, the network architecture is trained in a self-supervised manner and the clustering result can thereby be obtained. Experiments on a variety of time series datasets demonstrate the superiority of our DTCC approach over the state-of-the-art.
translated by 谷歌翻译