Recent research has reported a performance degradation in self-supervised contrastive learning for specially designed efficient networks, such as MobileNet and EfficientNet. A common practice to address this problem is to introduce a pretrained contrastive teacher model and train the lightweight networks with distillation signals generated by the teacher. However, it is time and resource consuming to pretrain a teacher model when it is not available. In this work, we aim to establish a stronger baseline for lightweight contrastive models without using a pretrained teacher model. Specifically, we show that the optimal recipe for efficient models is different from that of larger models, and using the same training settings as ResNet50, as previous research does, is inappropriate. Additionally, we observe a common issu e in contrastive learning where either the positive or negative views can be noisy, and propose a smoothed version of InfoNCE loss to alleviate this problem. As a result, we successfully improve the linear evaluation results from 36.3\% to 62.3\% for MobileNet-V3-Large and from 42.2\% to 65.8\% for EfficientNet-B0 on ImageNet, closing the accuracy gap to ResNet50 with $5\times$ fewer parameters. We hope our research will facilitate the usage of lightweight contrastive models.
translated by 谷歌翻译
由于具有强大的功能学习能力和高效率,深层哈希在大规模图像检索中取得了巨大的成功。同时,广泛的作品表明,深层神经网络(DNN)容易受到对抗例子的影响,并且探索针对深哈希的对抗性攻击吸引了许多研究工作。然而,尚未对Backdoor攻击(对DNNS的另一个著名威胁)进行深入研究。尽管图像分类领域已经提出了各种后门攻击,但现有方法未能实现真正的不可思议的后门攻击,该攻击享受着隐形触发器并同时享受清洁标签设置,而且它们也无法满足图像检索后门的内在需求。在本文中,我们提出了Badhash,这是第一个基于生成的无透感的后门攻击,对深哈希的攻击,它可以有效地用干净的标签产生隐形和投入特定的中毒图像。具体而言,我们首先提出了一种新的条件生成对抗网络(CGAN)管道,以有效生成中毒样品。对于任何给定的良性图像,它试图产生具有独特无形扳机的自然中毒对应物。为了提高攻击效果,我们引入了基于标签的对比学习网络LabCln来利用不同标签的语义特征,随后将其用于混淆和误导目标模型以学习嵌入式触发器。我们终于探索了在哈希空间中对图像检索的后门攻击的机制。在多个基准数据集上进行的广泛实验证明,Badhash可以生成不察觉的中毒样本,具有强大的攻击能力和对最新的深层哈希方案的可转移性。主要主题领域:[参与]多媒体搜索和建议
translated by 谷歌翻译
后门深度学习(DL)模型的行为通常在清洁输入上,但在触发器输入时不端行为,因为后门攻击者希望为DL模型部署构成严重后果。最先进的防御是限于特定的后门攻击(源无关攻击)或在该机器学习(ML)专业知识或昂贵的计算资源中不适用于源友好的攻击。这项工作观察到所有现有的后门攻击都具有不可避免的内在弱点,不可转换性,即触发器输入劫持劫持模型,但不能对另一个尚未植入同一后门的模型有效。通过此密钥观察,我们提出了不可转换性的反向检测(NTD)来识别运行时在运行时的模型欠测试(MUT)的触发输入。特定,NTD允许潜在的回溯静电预测输入的类别。同时,NTD利用特征提取器(FE)来提取输入的特征向量,并且从其预测类随机拾取的一组样本,然后比较FE潜在空间中的输入和样本之间的相似性。如果相似性低,则输入是对逆势触发输入;否则,良性。 FE是一个免费的预训练模型,私下从开放平台保留。随着FE和MUT来自不同来源,攻击者非常不可能将相同的后门插入其中两者。由于不可转换性,不能将突变处工作的触发效果转移到FE,使NTD对不同类型的后门攻击有效。我们在三个流行的定制任务中评估NTD,如面部识别,交通标志识别和一般动物分类,结果确认NDT具有高效率(低假验收率)和具有低检测延迟的可用性(低误报率)。
translated by 谷歌翻译
尽管深度神经网络模型在各种应用程序中表现出出色的性能,但它们的较大模型大小和广泛的浮点操作使移动计算平台上的部署成为主要挑战,尤其是在物联网设备上。一种吸引人的解决方案是模型量化,可降低模型大小并使用微控制器通常支持的整数操作。为此,1位量化的DNN模型或深二进制神经网络可最大化存储效率,其中BNN模型中的每个参数仅具有1位。在本文中,我们提出了一个可重构的BNN(RBNN),以进一步扩大资源约束的物联网设备的内存效率。通常,可以根据需要重新配置RBNN,以实现具有相同参数集的M(m> 1)不同的任务,因此只有一个任务决定了内存要求。换句话说,通过时间M改善了内存利用率。我们的广泛实验证实了多达七个常用的任务可以共存(M的值更大)。这些具有不同类别的任务在三个二氧化流行的DNN体系结构(包括VGG,Resnet和ReactNet)上没有准确性或微不足道的准确性下降。这些任务跨越了不同域,例如本文验证的计算机视觉和音频域,并以模型体系结构可以服务于这些跨域任务的先决条件。为了保护RBNN模型的知识属性,可以通过用户密钥和由固有硬件指纹生成的设备唯一的根键来控制重新配置。通过这样做,RBNN模型只能使用每个授权设备的每个付费用户使用,从而使用户和模型提供商受益。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译