Adversarial robustness assessment for video recognition models has raised concerns owing to their wide applications on safety-critical tasks. Compared with images, videos have much high dimension, which brings huge computational costs when generating adversarial videos. This is especially serious for the query-based black-box attacks where gradient estimation for the threat models is usually utilized, and high dimensions will lead to a large number of queries. To mitigate this issue, we propose to simultaneously eliminate the temporal and spatial redundancy within the video to achieve an effective and efficient gradient estimation on the reduced searching space, and thus query number could decrease. To implement this idea, we design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions from the inter-frames and intra-frames in the video. AstFocus attack is based on the cooperative Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible for selecting key frames, and another agent is responsible for selecting key regions. These two agents are jointly trained by the common rewards received from the black-box threat models to perform a cooperative prediction. By continuously querying, the reduced searching space composed of key frames and key regions is becoming precise, and the whole query number becomes less than that on the original video. Extensive experiments on four mainstream video recognition models and three widely used action recognition datasets demonstrate that the proposed AstFocus attack outperforms the SOTA methods, which is prevenient in fooling rate, query number, time, and perturbation magnitude at the same.
translated by 谷歌翻译
Adversarial patch is an important form of real-world adversarial attack that brings serious risks to the robustness of deep neural networks. Previous methods generate adversarial patches by either optimizing their perturbation values while fixing the pasting position or manipulating the position while fixing the patch's content. This reveals that the positions and perturbations are both important to the adversarial attack. For that, in this paper, we propose a novel method to simultaneously optimize the position and perturbation for an adversarial patch, and thus obtain a high attack success rate in the black-box setting. Technically, we regard the patch's position, the pre-designed hyper-parameters to determine the patch's perturbations as the variables, and utilize the reinforcement learning framework to simultaneously solve for the optimal solution based on the rewards obtained from the target model with a small number of queries. Extensive experiments are conducted on the Face Recognition (FR) task, and results on four representative FR models show that our method can significantly improve the attack success rate and query efficiency. Besides, experiments on the commercial FR service and physical environments confirm its practical application value. We also extend our method to the traffic sign recognition task to verify its generalization ability.
translated by 谷歌翻译
The input and output of most text generation tasks can be transformed to two sequences of tokens and they can be modeled using sequence-to-sequence learning modeling tools such as Transformers. These models are usually trained by maximizing the likelihood the output text sequence and assumes the input sequence and all gold preceding tokens are given during training, while during inference the model suffers from the exposure bias problem (i.e., it only has access to its previously predicted tokens rather gold tokens during beam search). In this paper, we propose MoCa ({\bf Mo}mentum {\bf Ca}libration) for text generation. MoCa is an online method that dynamically generates slowly evolving (but consistent) samples using a momentum moving average generator with beam search and MoCa learns to align its model scores of these samples with their actual qualities. Experiments on four text generation datasets (i.e., CNN/DailyMail, XSum, SAMSum and Gigaword) show MoCa consistently improves strong pre-trained transformers using vanilla fine-tuning and we achieve the state-of-the-art results on CNN/DailyMail and SAMSum datasets.
translated by 谷歌翻译
Prompts with different control signals (e.g., length, keywords, etc.) can be used to control text summarization. When control signals are available, they can control the properties of generated summaries and potentially improve summarization quality (since more information are given). Unfortunately, control signals are not already available during inference time. In this paper, we propose Lotus (shorthand for Latent Prompt Tuning for Summarization), which is a single model that can be applied in both controlled and uncontrolled (without control signals) modes. During training, Lotus learns latent prompt representations from prompts with gold control signals using a contrastive learning objective. Experiments show Lotus in uncontrolled mode consistently improves upon strong (uncontrollable) summarization models across four different summarization datasets. We also demonstrate generated summaries can be controlled using prompts with user specified control tokens.
translated by 谷歌翻译
近年来,将多光谱数据集成在对象检测中,尤其是可见的和红外图像。由于可见(RGB)和红外(IR)图像可以提供互补的信息来处理光变化,因此在许多领域中使用了配对图像,例如多光谱的行人检测,RGB-IR人群计数和RGB-IR显着对象检测。与天然RGB-IR图像相比,我们发现空中RGB-IR图像中的检测遭受跨模式弱的未对准问题,这些问题表现在同一物体的位置,大小和角度偏差。在本文中,我们主要解决了空中RGB-IR图像中跨模式弱未对准的挑战。具体而言,我们首先解释和分析了弱错位问题的原因。然后,我们提出了一个翻译尺度的反向对齐(TSRA)模块,以通过校准这两种方式的特征图来解决问题。该模块通过对齐过程预测了两个模式对象之间的偏差,并利用模态选择(MS)策略来提高对齐的性能。最后,基于TSRA模块的两流特征比对检测器(TSFADET)是为空中图像中的RGB-IR对象检测构建的。通过对公共无人机数据集进行的全面实验,我们验证我们的方法是否降低了交叉模式未对准的效果并实现了可靠的检测结果。
translated by 谷歌翻译
快速对抗训练(脂肪)有效地提高了标准对抗训练(SAT)的效率。然而,初始脂肪遇到灾难性的过度拟合,即,对抗性攻击的稳健精度突然并大大减少。尽管有几种脂肪变体毫不费力地防止过度拟合,但他们牺牲了很多计算成本。在本文中,我们探讨了SAT和FAT的训练过程之间的差异,并观察到,对抗性实例(AES)脂肪的攻击成功率在后期训练阶段逐渐变得更糟,从而导致过度拟合。 AE是通过零或随机初始化的快速梯度标志方法(FGSM)生成的。根据观察结果,我们提出了一种先前的FGSM初始化方法,以避免在研究多种初始化策略后避免过度适应,从而在整个训练过程中提高了AE的质量。初始化是通过利用历史上生成的AE而没有额外计算成本而形成的。我们进一步为提出的初始化方法提供了理论分析。我们还基于先前的初始化,即当前生成的扰动不应过多地偏离先前引导的初始化,因此我们还提出了一个简单而有效的正规化程序。正常化器同时采用历史和当前的对抗性扰动来指导模型学习。在四个数据集上进行的评估表明,所提出的方法可以防止灾难性过度拟合和优于最先进的脂肪方法。该代码在https://github.com/jiaxiaojunqaq/fgsm-pgi上发布。
translated by 谷歌翻译
对比学习模型在无监督的视觉表示学习中取得了巨大成功,这使得相同图像的不同视图的特征表示之间的相似性最大化,同时最小化不同图像的视图的特征表示之间的相似性。在文本摘要中,输出摘要是输入文档的较短形式,它们具有类似的含义。在本文中,我们提出了对监督抽象文本摘要的对比学习模型,在那里我们查看文档,它的金摘要及其模型生成的摘要,与相同的平均表示的不同视图,并在培训期间最大化它们之间的相似性。我们在三个不同的摘要数据集上改进了一个强序列到序列文本生成模型(即,BART)。人类评估还表明,与其对应物相比,我们的模型达到了更好的忠实性评级,没有对比的目标。
translated by 谷歌翻译
To assess the vulnerability of deep learning in the physical world, recent works introduce adversarial patches and apply them on different tasks. In this paper, we propose another kind of adversarial patch: the Meaningful Adversarial Sticker, a physically feasible and stealthy attack method by using real stickers existing in our life. Unlike the previous adversarial patches by designing perturbations, our method manipulates the sticker's pasting position and rotation angle on the objects to perform physical attacks. Because the position and rotation angle are less affected by the printing loss and color distortion, adversarial stickers can keep good attacking performance in the physical world. Besides, to make adversarial stickers more practical in real scenes, we conduct attacks in the black-box setting with the limited information rather than the white-box setting with all the details of threat models. To effectively solve for the sticker's parameters, we design the Region based Heuristic Differential Evolution Algorithm, which utilizes the new-found regional aggregation of effective solutions and the adaptive adjustment strategy of the evaluation criteria. Our method is comprehensively verified in the face recognition and then extended to the image retrieval and traffic sign recognition. Extensive experiments show the proposed method is effective and efficient in complex physical conditions and has a good generalization for different tasks.
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Given the increasingly intricate forms of partial differential equations (PDEs) in physics and related fields, computationally solving PDEs without analytic solutions inevitably suffers from the trade-off between accuracy and efficiency. Recent advances in neural operators, a kind of mesh-independent neural-network-based PDE solvers, have suggested the dawn of overcoming this challenge. In this emerging direction, Koopman neural operator (KNO) is a representative demonstration and outperforms other state-of-the-art alternatives in terms of accuracy and efficiency. Here we present KoopmanLab, a self-contained and user-friendly PyTorch module of the Koopman neural operator family for solving partial differential equations. Beyond the original version of KNO, we develop multiple new variants of KNO based on different neural network architectures to improve the general applicability of our module. These variants are validated by mesh-independent and long-term prediction experiments implemented on representative PDEs (e.g., the Navier-Stokes equation and the Bateman-Burgers equation) and ERA5 (i.e., one of the largest high-resolution data sets of global-scale climate fields). These demonstrations suggest the potential of KoopmanLab to be considered in diverse applications of partial differential equations.
translated by 谷歌翻译