最近,行动识别因其在智能监视和人为计算机互动方面的全面和实用应用而受到了越来越多的关注。但是,由于数据稀缺性,很少有射击动作识别并未得到充分的探索,并且仍然具有挑战性。在本文中,我们提出了一种新型的分层组成表示(HCR)学习方法,以进行几次识别。具体而言,我们通过精心设计的层次聚类将复杂的动作分为几个子行动,并将子动作进一步分解为更细粒度的空间注意力亚actions(SAS-Actions)。尽管基类和新颖类之间存在很大的差异,但它们可以在子行动或SAS行为中共享相似的模式。此外,我们在运输问题中采用了地球移动器的距离,以测量视频样本之间的相似性在亚行动表示方面。它计算为距离度量的子行动之间的最佳匹配流,这有利于比较细粒模式。广泛的实验表明,我们的方法在HMDB51,UCF101和动力学数据集上实现了最新结果。
translated by 谷歌翻译
动态面部表达识别(FER)数据库为情感计算和应用提供了重要的数据支持。但是,大多数FER数据库都用几个基本的相互排斥性类别注释,并且仅包含一种模式,例如视频。单调的标签和模式无法准确模仿人类的情绪并实现现实世界中的应用。在本文中,我们提出了MAFW,这是一个大型多模式复合情感数据库,野外有10,045个视频Audio剪辑。每个剪辑都有一个复合的情感类别和几个句子,这些句子描述了剪辑中受试者的情感行为。对于复合情绪注释,每个剪辑都被归类为11种广泛使用的情绪中的一个或多个,即愤怒,厌恶,恐惧,幸福,中立,悲伤,惊喜,蔑视,焦虑,焦虑,无助和失望。为了确保标签的高质量,我们通过预期最大化(EM)算法来滤除不可靠的注释,然后获得11个单标签情绪类别和32个多标签情绪类别。据我们所知,MAFW是第一个带有复合情感注释和与情感相关的字幕的野外多模式数据库。此外,我们还提出了一种新型的基于变压器的表达片段特征学习方法,以识别利用不同情绪和方式之间表达变化关系的复合情绪。在MAFW数据库上进行的广泛实验显示了所提出方法的优势,而不是其他最先进的方法对单型和多模式FER的优势。我们的MAFW数据库可从https://mafw-database.github.io/mafw公开获得。
translated by 谷歌翻译
本报告简要说明了我们对AVA主动扬声器检测(ASD)任务的获胜解决方案,在ActivityNet Challenge 2022.我们的基础模型Unicon+继续建立在我们先前的工作,统一的上下文网络(UNICON)和扩展Unicon的基础上对于强大的场景级ASD。我们使用一个简单的基于GRU的模块来增强体系结构,该模块允许重复身份的信息通过阅读和更新操作在场景中流动。我们报告了Ava-Activespeaker测试集94.47%的地图的最佳结果,该测试套件在今年的挑战排行榜上继续排名第一,并显着推动了最新的成绩。
translated by 谷歌翻译
The current popular two-stream, two-stage tracking framework extracts the template and the search region features separately and then performs relation modeling, thus the extracted features lack the awareness of the target and have limited target-background discriminability. To tackle the above issue, we propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling by bridging the template-search image pairs with bidirectional information flows. In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance. Since no extra heavy relation modeling module is needed and the implementation is highly parallelized, the proposed tracker runs at a fast speed. To further improve the inference efficiency, an in-network candidate early elimination module is proposed based on the strong similarity prior calculated in the one-stream framework. As a unified framework, OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k, i.e., achieving 73.7% AO, improving the existing best result (SwinTrack) by 4.3\%. Besides, our method maintains a good performance-speed trade-off and shows faster convergence. The code and models are available at https://github.com/botaoye/OSTrack.
translated by 谷歌翻译
对抗性攻击提供了研究深层学习模式的稳健性的好方法。基于转移的黑盒攻击中的一种方法利用了几种图像变换操作来提高对逆势示例的可转换性,这是有效的,但不能考虑输入图像的特定特征。在这项工作中,我们提出了一种新颖的架构,称为自适应图像转换学习者(AIT1),其将不同的图像变换操作结合到统一的框架中,以进一步提高对抗性示例的可转移性。与现有工作中使用的固定组合变换不同,我们精心设计的转换学习者自适应地选择特定于输入图像的图像变换最有效的组合。关于Imagenet的广泛实验表明,我们的方法在各种设置下显着提高了正常培训的模型和防御模型的攻击成功率。
translated by 谷歌翻译
近年来,随着神经网络的快速发展,深入学习模式的安全性越来越突出,这易于对抗性示例。几乎所有现有的基于梯度的攻击方法使用生成中的符号功能来满足$ l_ \ idty $ norm的扰动预算要求。然而,我们发现,由于它修改了精确梯度方向,因此可以对生成对抗示例的符号功能可能是不正确的。我们建议去除符号功能,并直接利用精确的梯度方向,具有缩放因子,以产生对抗扰动,即使扰动较少的扰动例子也提高了对抗性示例的攻击成功率。此外,考虑到最佳缩放因子在不同的图像上变化,我们提出了一种自适应缩放因子发生器,为每个图像寻求适当的缩放因子,这避免了手动搜索缩放因子的计算成本。我们的方法可以与几乎所有现有的基于梯度的攻击方法集成,以进一步提高攻击成功率。在CIFAR10和Imagenet数据集上的广泛实验表明,我们的方法表现出更高的可转移性和优于最先进的方法。
translated by 谷歌翻译
Image-level weakly supervised semantic segmentation is a challenging problem that has been deeply studied in recent years. Most of advanced solutions exploit class activation map (CAM). However, CAMs can hardly serve as the object mask due to the gap between full and weak supervisions. In this paper, we propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap. Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation, whose pixel-level labels take the same spatial transformation as the input images during data augmentation. However, this constraint is lost on the CAMs trained by image-level supervision. Therefore, we propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning. Moreover, we propose a pixel correlation module (PCM), which exploits context appearance information and refines the prediction of current pixel by its similar neighbors, leading to further improvement on CAMs consistency. Extensive experiments on PASCAL VOC 2012 dataset demonstrate our method outperforms state-of-the-art methods using the same level of supervision. The code is released online 1 .
translated by 谷歌翻译
Few-shot classification aims to recognize unlabeled samples from unseen classes given only few labeled samples. The unseen classes and low-data problem make few-shot classification very challenging. Many existing approaches extracted features from labeled and unlabeled samples independently, as a result, the features are not discriminative enough. In this work, we propose a novel Cross Attention Network to address the challenging problems in few-shot classification. Firstly, Cross Attention Module is introduced to deal with the problem of unseen classes. The module generates cross attention maps for each pair of class feature and query sample feature so as to highlight the target object regions, making the extracted feature more discriminative. Secondly, a transductive inference algorithm is proposed to alleviate the low-data problem, which iteratively utilizes the unlabeled query set to augment the support set, thereby making the class features more representative. Extensive experiments on two benchmarks show our method is a simple, effective and computationally efficient framework and outperforms the state-of-the-arts.
translated by 谷歌翻译
旋转不变的面部检测,即用任意旋转平面(RIP)角度的检测面,广泛需要在无约束的应用中被广泛地需要,但由于面部出现的较大变化,仍然仍然是一个具有挑战性的任务。大多数现有方法符合速度或准确性以处理大的撕裂变体。为了更有效地解决这个问题,我们提出了逐步校准网络(PCN)以粗略的方式执行旋转不变的面部检测。 PCN由三个阶段组成,每个阶段不仅将面与非面孔区分开,而且还校准了每个面部候选者的RIP方向逐渐直立。通过将校准过程划分为几个渐进步骤,并且仅预测早期阶段中的粗定向,PCN可以实现精确且快速校准。通过对脸部与逐渐减小的RIP范围进行二进制分类,PCN可以准确地检测满360 ^ {\ rIC} $ RIP角度的面部。这种设计导致实时旋转不变面检测器。在野外的多面向FDDB的实验和疯狂旋转面的较宽面的具有挑战性的子集表明我们的PCN实现了非常有希望的性能。
translated by 谷歌翻译
Facial attribute editing aims to manipulate single or multiple attributes of a face image, i.e., to generate a new face with desired attributes while preserving other details. Recently, generative adversarial net (GAN) and encoder-decoder architecture are usually incorporated to handle this task with promising results. Based on the encoder-decoder architecture, facial attribute editing is achieved by decoding the latent representation of the given face conditioned on the desired attributes. Some existing methods attempt to establish an attributeindependent latent representation for further attribute editing. However, such attribute-independent constraint on the latent representation is excessive because it restricts the capacity of the latent representation and may result in information loss, leading to over-smooth and distorted generation. Instead of imposing constraints on the latent representation, in this work we apply an attribute classification constraint to the generated image to just guarantee the correct change of desired attributes, i.e., to "change what you want". Meanwhile, the reconstruction learning is introduced to preserve attribute-excluding details, in other words, to "only change what you want". Besides, the adversarial learning is employed for visually realistic editing. These three components cooperate with each other forming an effective framework for high quality facial attribute editing, referred as AttGAN. Furthermore, our method is also directly applicable for attribute intensity control and can be naturally extended for attribute style manipulation. Experiments on CelebA dataset show that our method outperforms the state-of-the-arts on realistic attribute editing with facial details well preserved.
translated by 谷歌翻译